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1 .  Separability  -  An  Approach  to  Physical 
Database  Design 

1.1  Introduction 

Problems  of  access  path  selection  in  large  integrated  databases  can  be  approached  from  two  standpoints. 
Query  optimization  seeks  the  optimal  selection  of  access  paths  for  a  specific  query  being  processed -given  a 
certain  structure  of  fhe  underlying  physical  database  (SMI  75]  [PEC  75]  [GOT  75]  [BLA  76]  [YAO  79]  [SEL 
79].  On  the  other  hand,  design  of  a  physical  database  is  concerned  with  the  optimal  configuration  of  physical 
file  and  ;  cccss  structures- given  the  logical  access  paths  that  represent  the  interconnections  among  objects  in 
the  data  model  ;  the  usage  patterns  of  those  paths;  the  organizational  characteristics  of  the  data  stored  in  the 
files;  the  various  features  of  the  particular  DBMS  such  as  available  access  structures  (indexes,  links,  hashed 
organization,  clustering  of  records,  etc.)  [HSI  70]  [CAR  75]  [SCH  75]  [SEV  75]  [HAM  76]  [YAO  77]  [BAT  80]. 
Throughout  this  paper  we  use  the  term  access  configuration  to  mean  the  aggregate  of  access  structures 
assigned  to  a  relation  or  to  the  whole  database. 

Most  past  research  directed  toward  optimal  design  of  physical  databases  has  concentrated  on  single-file 
cases.  This  research  must  be  extended  to  the  design  of  the  access  configuration  of  multifile  databases. 
Although  some  efforts  have  been  devoted  to  multifile  cases  [GAM  77]  [BAT  8Q][KAT  80],  the  approaches 
employed  fall  far  short  of  accomplishing  automatic  design  of  optimal  physical  databases. 

In  this  paper  we  discuss  the  issues  involved  in  designing  the  access  coi  r  oration  of  a  physical  database  so 
as  to  minimize  the  number  of  disk  accesses  for  queries  and  updates.  Our  approach  is  somewhat  formal  and 
mathematical,  deliberately  avoiding  excessive  reliance  on  heuristics.  Our  purpose  is  to  render  the  whole 
design  phase  manageable  andto  facilitate  understanding  of  the  underlying  mechanisms. 

By  analyzing  an  important  set  of  join  methods  possessing  the  property  we  call  separability ,  we  shall  prove 
that  optimal  design  of  the  access  configuration  of  a  multifile  database  can  be  reduced  to  the  collective  optimal 
designs  of  individual  relations.  In  this  paper  we  restrict  the  available  join  methods  to  this  set  to  make  the 
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whole  approach  formally  manageable.  Extensions  to  other  join  methods  will  be  mentioned  briefly.  The  main 
idea  is  to  set  up  a  basic  design  methodology  in  accordance  with  a  formal  method  that  includes  a  large  subset  of 
practically  important  join  methods,  and  then,  using  some  straightforward  heuristics,  extend  this  basic  design 
methodology  to  include  other  join  methods  as  well. 

Section  1.2  introduces  several  key  assumptions,  while  Section  1.3  describes  applicable  join  methods  of 
interest.  In  Section  1.5,  the  design  theory  will  be  developed  by  using  the  simple  cost  model  introduced  for  the 
examples  in  Section  1.4.  A  design  algorithm  based  on  the  theory  will  be  introduced  in  Section  1.6. 
Extensions  of  our  approach  are  mentioned,  briefly  in  Section  1.7. 

1.2  Approaches  and  Assumptions 

The  design  of  an  optimal  physical  database  is  complex  for  a  number  of  reasons -two  of  which  we  shall 
discuss  here.  First,  we  may  have  several  types  of  access  structures  available  as  options.  Although  some 
generalized  formulas  for  determining  access  cost  have  been  devised  for  certain  kinds  of  file  structures  [HSI 70] 
[SEV  75]  [YAO  77]  [BAT  80],  it  is  generally  difficult  to  use  them  for  the  selection  of  optimal  file  structures 
without  an  exhaustive  search  among  all  possible  alternatives.  It  therefore  becomes  necessary  to  accomplish  a 
judicious  separation  of  design  steps  and  to  develop  interfaces  that  will  minimize  interactions  among  those 
steps. 

The  second  source  of  complexity  addressed  is  the  interaction  among  the  access  structures  assigned  to 
different  relations.  There  are  various  techniques  available,  especially  join  methods,  for  processing  a  query  — 
and  the  choice  frequently  depends  on  the  access  structures  available  on  more  than  one  relation.  Therefore, 
the  processing  cost  of  a  query  associated  with  one  relation  depends  upon  other  interacting  relations.  It  is  the 
purpose  of  this  paper  to  provide  a  mechanism  for  coping  with  these  interactions  during  the  design  phase. 

We  choose  a  relational  DBMS  and  start  with  the  indexes  and  the  clustering  property  of  a  single  relation  as 
the  initially  available  access  structures.  The  link  structure  [BLA  76]  will  be  included  as  an  extension  of  the 
basic  result  by  using  heuristics.  Gustcring  of  two  or  more  relations,  as  in  many  hierarchical  organizations,  is 
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not  considered.  Wc  also  assume  that  all  TID  (tuple  identifier)  manipulations  can  be  performed  in  the  main 
memory  without  any  need  to  perform  I/O  accesses. 

The  database  is  assumed  to  reside  on  disklike  devices.  Physical  storage  space  for  the  database  is  divided 
into  units  of  fixed  size  called  blocks  [WIE  77].  The  block  is  not  only  the  unit  of  disk  allocation,  but  is  also  the 
unit  of  transfer  between  main  memory  and  disk.  We  assume  that  a  block  that  contains  tuples  of  a  relation 
contains  only  the  tuples  of  that  relation.  Furthermore,  we  assume  that  the  blocks  containing  tuples  of  a 
relation,  which  comprises  a  file,  can  be  accessed  serially.  However,  the  blocks  do  not  have  to  be  contiguous 
on  the  disk.2 

In  principle,  we  assume  that  a  relation  is  mapped  into  a  single  file.  Accordingly,  from  now  on,  wc  shall  use 
the  terms  file  and  relation  interchangably.  This  does  not  mean,  however,  that  we  exclude  the  possibility  of 
storing  prejoined  forms  of  relations  directly  in  the  physical  database.  We  believe  this  can  be  considered  in  a 
separate  refining  phase  after  the  basic  design  has  been  obtained. 

We  shall  develop  a  simple  cost  model  of  the  storage  structure  in  Section  1.4,  and  shall  use  various  cost 
formulas  based  on  this  model.  For  convenience,  we  assume  that  the  size  of  the  available  buffer  is  one  block. 
However,  the  theory  we  develop  is  not  dependent  on  the  buffer  size,  if  we  ignore  the  contention  among  many 
transactions  in  the  buffer  pool  at  query-processing  time.  Not  cncorporated  in  our  theory  are  either  the  effect 
of  the  contention  in  the  buffer  pool  and  the  scheduling  algorithm. 

We  consider  only  one-to-many  (including  one-to-one)  relationships  between  relations.  It  is  argued  in 
Appendix  A  that  many-to-many  relationships  between  relations  are  less  important  for  the  optimization.  Note 
that  here  wc  arc  dealing  with  relationships  in  relational  representations,  so  that  a  relationship  among  distinct 
entity  sets  at  the  conceptual  level  is  often  structured  with  an  additional  intermediate  relation. 

Finally,  we  are  considering  only  one-variable  or  two-variable  queries  in  this  paper.  For  a  query  of  more 


2 


For  example,  blocks  of  a  Pile  can  be  spread  all  over  the  disk  while  they  are  connected  as  a  linked  list  or  linked  implicitly  by  a  file  map. 


-5- 


SEPARABILITY  AS  A  PHYSICAL  DATABASE  DESIGN  METHODOLOGY 


than  two  variables,  a  heuristic  approach  can  be  employed  to  decompose  it  into  a  sequence  of  two-variable 
queries  (These  correspond  to  one-overlapping  queries  in  [WON  76]). 

1.3  Query  Evaluation 

The  class  of  queries  we  shall  be  considering  is  shown  in  Figure  1-1.  The  conceptual  meaning  of  this  class  of 
queries  is  as  follows.  Tuples  in  relation  Ri  are  restricted  by  restriction  predicate  Pr  Likewise  tuples  in 
relation  R2  are  restricted  by  predicate  P2.  The  resulting  tuples  from  each  relation  are  joined  according  to  the 
join  predicate  RrA  =  R2.B,  and  the  result  projected  over  the  columns  ar..an.  We  shall  call  the  columns  that 
are  involved  in  the  restriction  predicates  restriction  columns ,  and  those  in  the  join  predicate  join  columns .  The 
actual  implementation  of  this  class  of  queries  does  not  have  to  follow  the  order  specified  above  as  long  as  it 
produces  the  same  result 

|  restriction 
j  predicate  ?x 

i  i 

|  R,  |  JOIN 

|  |  Rj .  A  =  RZ.B 


ai • a2  ’  ' ' a« 

Figure  1*  1 :  General  Class  of  Queries  to  be  Considered. 

Query  evaluation  algorithms,  especially  for  two-variable  queries,  have  been  studied  in  [BLA  76]  and  [YAO 
79].  The  algorithms  for  evaluating  queries  differ  significantly  in  the  way  they  use  join  methods.  Before 
discussing  the  various  join  methods,  let  us  define  some  terminology. 

Given  a  query,  an  index  is  called  a  join  index  if  it  is  defined  for  the  join  column  of  a  relation.  Likewise,  an 
index  is  called  a  restriction  index  if  it  is  defined  for  a  restriction  column.  We  shall  use  the  term  subtuple  for  a 
tuple  that  has  been  projected  over  some  columns.  The  restriction  predicate  in  a  query  for  each  relation  is 
decomposed  into  the  form  Q1  A  Q2,  where  Q1  is  a  predicate  that  can  be  processed  by  using  indexes  while  Q2 


|  restriction 
j  predicate  P2 

I  I 

I  r2  I 

I  I 
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cannot  Q2  must  then  be  resolved  by  accessing  individual  records.  We  shall  call  Q1  the  index- processible 
predicate  and  Q2  the  residual  predicate . 


Some  algorithms  of  interest  for  processing  joins  arc  summarized  briefly  hereunder  (see  also  [BLA  76)  [SEL 
79]): 

•  Join  Index  Method :  This  method  presupposes  the  existence  of  join  indexes.  For  each  relation,  the 
TIDs  of  tuples  that  satisfy  the  index  processible  predicates  we.  obtained  by  manipulating  the  TIDs 
from  each  index  involved;  the  resultant  TIDs  are  stored  in  temporary  relations  and  R 2\  TID 
pairs  with  the  same  join  column  values  are  found  by  scanning  the  join  column  indexes  according 
to  the  order  of  the  join  column  values.  As  they  are  found,  each  TID  pair  (TIDr  TID2)  is  checked 
to  determine  whether  TIDj  is  present  in  and  TID2  in  R2\  If  they  arc,  the  corresponding  tuple 
in  one  relation,  say  Rr  is  retrieved:  When  this  tuple  satisfies  the  residual  predicate  for  Rp  the 
corresponding  tuple  in  the  other  relation  R^  is  retrieved  and  the  residual  predicate  for  R2  is 
checked.  If  qualified,  the  tuples  are  concatenated  and  the  subtuple  of  interest  is  constructed. 

•  Sort-Merge  Method :  The  relations  Rl  and  R2  are  scanned- either  by  using  restriction  indexes,  if 
there  is  an  index-processible  predicate  in  the  query,  or  by  scanning  the  relation  directly -and 
temporary  relations  T]  and  T2  arc  created.  Restrictions,  partial  projections,  and  the  initial  step  of 
sorting  are  performed  while  die  relations  are  being  initially  scanned  and  stored  in  T}  and  T2.  T, 
and  T2  are  sorted  by  the  join  column  values.  The  resulting  relations  arc  scanned  in  parallel  and 
the  join  is  completed  by  merging  matching  tuples. 

•  Combination  of  the  Join  Index  Method  and  the  Sort-Merge  Method :  One  relation,  say  Rl,  is 
sorted  as  in  the  sort-merge  method  and  stored  in  Tj.  Relation  R2  is  processed  as  in  the  join  index 
method,  storing  the  TIDs  of  the  tuples  that  satisfy  the  index  processible  predicates  in  R2\  Tj  and 
the  join  column  index  of  R2  are  scanned  according  to  the  join  column  values.  As  matching  join 
column  values  arc  found,  each  TID  from  the  join  index  of  R2  is  checked  against  R2 .  If  it  is  in  R2 , 
the  corresponding  tuple  in  R2  is  retrieved  and  the  residual  predicate  for  R2  is  checked.  If 
qualified,  the  tuples  are  concatenated  and  the  subtuplc  is  constructed.3 

•  Inner/Outer- Loop  Join  Method.  In  the  two  join  methods  described  above,  the  join  is  performed 
by  scanning  relations  in  the  order  of  the  join  column  values.  In  the  inner/outer-loop  join,  one  of 
the  relations,  say  Rr  is  scanned  without  regard  to  order,  cither  by  using  restriction  indexes  or  by 
scanning  the  relation  directly,  and,  for  each  tuple  of  R}  that  satisfies  predicate  Pj,  the  tuples  of 
relation  R2  that  satisfy  predicate  P2  and  the  join  predicate  are  retrieved  and  concatenated  with  the 
tuple  of  Rr  The  subtuplcs  of  interest  are  then  projected  upon  the  result4 


In  actual  implementation,  the  combinations  of  join  methods  can  be  cither  coded  separately  or  programmed  to  be  dynamically 
synthesized  at  query  processing  time  A  specific  combination  of  join  methods  will  be  selected  or  synthesized  according  to  the  result  of 
the  query  optimization  which,  given  a  fixed  structure  of  the  physical  database,  will  find  the  best  evaluation  method  for  a  query. 

4 

One  of  the  advantages  of  this  join  is  that  it  docs  not  require  scanning  a  relation  in  a  sorted  order.  Furthermore,  this  method  is  often 
better  than  the  join  index  method  if  the  number  of  qualified  tuples  retrieved  from  R2  is  small,  making  it  unnecessary  to  scan  the  entire 
join  index  for  R2.  On  the  other  hand,  if  a  large  portion  of  relation  R2  satisfies  the  predicates,  this  method  will  cause  repeated  accesses  of 
the  index  tree  -  which  will  be  more  costly  than  a  single  scan  of  the  index. 
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•  Multiple- Pass  Method.  One  of  the  relations  participating  in  the  join,  say  Rr  is  scanned,  the  tuples 
are  obtained,  restricted,  projected,  and  inserted  into  a  data  structure  Tj,  whose  size  is  constrained 
to  fit  in  the  available  main  store.  If  space  in  main  store  is  available  to  insert  the  resulting  subtuple* 
r,  this  is  done.  If  space  is  not  available,  but  the  join  column  value  in  r  is  less  than  the  current 
highest  join  column  value  in  Tj,  the  subtuplcs  with  the  highest  join  column  value  in  Tx  are  then 
deleted  and  r  is  inserted.  Otherwise  r  is  not  inserted  at  all.  After  T1  has  been  formed,  R2  is 
scanned  by  using  an  appropriate  access  path,  and  every  tuple  of  R2  that  satisfies  the  predicate  is 
concatenated  (if  possible)  with  the  appropriate  subtuples  in  Tl  and  the  result  projected.  If  there 
are  more  qualified  tuples  in  Rj  than  can  fit  in  the  main  store  for  another  scan  of  Rj  is  done  to 
form  a  new  Tj  consisting  of  subtuples  with  join  column  values  greater  than  the  current  highest. 
R2  is  also  scanned  again  and  the  whole  process  repeated.  This  method  is  very  fast  if  only  one  pass 
is  needed.  But  processing  time  increases  rapidly  when  more  passes  are  performed. 

•  Link- Based  Join  Method  This  is  conceptually  similar  to  the  inner/outer- loop  join  method,  but  it 
takes  advantage  of  existing  links  [BLA  76]  between  the  two  relations.  The  use  of  links  will  be 
mentioned  briefly  as  an  extension  of  our  basic  methodology. 


Let  us  note  that,  in  the  combination  of  the  join  index  method  and  the  sort-merge  method,  the  operation 
performed  on  either  relation  is  identical  to  that  performed  on  one  relation  -  whether  in  the  join  index  method 
or  the  sort-merge  method.  We  call  the  operations  performed  on  each  relation  join  index  method  (partial)  or 
sort-merge  method  (partial),  respectively;  whenever  no  confusion  arises,  we  call  these  operations  simply  join 
index  method  or  sort  merge  method.  According  to  these  definitions,  the  join  index  method  actually  consists  of 
two  join  index  methods  (partial)  and  similarly  the  sort-merge  method  consists  of  two  sort-merge  methods 
(partial). 

1.4  Cost  Model  of  the  Storage  Structure 

To  calculate  the  cost  of  evaluating  a  query,  we  need  a  proper  model  of  the  underlying  storage  structure  and 
its  corresponding  cost  formula.  Although  the  theory  does  not  depend  on  the  specifics  of  cost  models,  it  is 
helpful  to  have  a  simple  cost  model  for  illustrative  purposes. 

We  assume  that  a  B-trcc  index  (BAY  72]  can  be  defined  for  a  column  or  for  a  set  of  columns  of  a  relation. 
The  leaf-level  of  the  index  consists  of  pairs  (key  and  TID)  for  every  tuple  in  that  relation.  The  leaf-level 
blocks  are  chained  according  to  the  order  of  indexed  column  values,  so  that  the  index  can  be  scanned  without 
traversing  the  index  tree.  Entries  having  the  same  key  value  arc  ordered  by  TID. 
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An  index  is  called  a  clustering  index  if  the  relation  for  which  this  index  is  defined  is  physically  clustered 
according  to  the  index  column  values.  With  a  clustering  index,  we  assume  that  no  block  is  fetched  more  than 
once  when  tuples  with  consecutive  values  of  the  indexed  column  are  retrieved.  Except  for  this  ordering 
property,  no  other  difference  in  the  structure  is  assumed  between  a  clustering  and  a  nonclustering  index.  The 
clustering  property  can  greatly  reduce  the  access  cost,  especially  when  a  join  column  has  a  clustering  index. 
Unfortunately,  only  one  column  of  a  relation  can  have  the  clustering  property,  since  clustering  requires  a 
specific  order  of  records  in  the  physical  file.  One  of  the  objectives  of  designing  optimal  physical  databases  is 
to  determine  which  column  will  be  assigned  the  clustering  property. 

The  access  cost  will  be  measured  in  terms  of  the  number  of  I/O  accesses.  The  following  notation  will  be 

used  throughout  this  paper: 

nR  :  Number  of  tuples  in  relation  R  (cardinality) 

pR  :  Blocking  factor  of  a  block  containing  tuples  of  relation  R. 

Lj  :  Blocking  factor  of  an  index  block  containing  index  I. 

Fc  :  Selectivity  of  the  column  used  or  the  index  thereof. 
mR  :  Number  of  blocks  in  relation  R,  which  is  equal  to  nR/pR. 

By  using  the  simplified  model  above,  the  cost  of  various  operations  can  be  obtained  as  follows: 

•  Relation  Scan  Cost  -  Cost  for  serially  accessing  all  the  blocks  containing  the  tuples  of  a  relation: 

RS(R)  =  nR/pR  -  mR 

•  Index  Scan  Cost  -  Cost  for  serially  accessing  the  leaf-  level  blocks  of  an  entire  index: 

IS(I,R)  =  nR/Lj 

•  Index  Access  Cost  -  Cost  for  one  access  of  the  index  tree  from  the  root: 

IA(I,R)  =  log^  (nR/Lj)  +  Fj  X  nR/Lj 

•  Sorting  Cost  -  Cost  for  sorting  a  relation,  or  a  part  thereof,  according  to  the  values  of  the  columns 
of  interest: 

SORT(NB)  =  2  X  NB  X  logz  NB 

Here  we  assume  that  a  z-way  sort-merge  is  used  for  the  external  sort  [KNU-b  73].  NB  is  the 
number  of  blocks  in  the  temporary  relation  containing  the  subtuples  to  be  sorted  after  restriction 
and  projection  have  been  resolved.  It  will  be  noted  that  SORT(NB)  docs  not  include  the  initial 
scanning  time  to  bring  in  the  original  relation,  while  it  docs  include  the  time  to  scan  the  temporary 
relation  for  the  actual  join  after  sorting  (sec  [BLA  76]). 
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1.5  Design  Theory 

In  this  section  we  develop  a  theory  for  the  design  of  optimal  physical  databases.  We  shall  seek  to  facilitate 
comprehension  through  a  scries  of  examples  and  by  case  analysis,  using  the  cost  model  developed  in  Section 
1.4.  Observations  resulting  from  this  procedure  are  formalized  and  proved  in  Section  1.5.4. 

Our  approach  to  physical  database  design  is  based  on  the  premise  that  at  execution  time  the  query 
processor  will  choose  the  best  processing  method  for  a  given  query.  We  call  this  processor  an  optimizer. 
Since  the  behavior  of  the  optimizer  at  execution  time  affects  the  physical  database  design  critically,  we 
investigate  this  issue  and  discuss  how  it  is  related  to  the  design. 

Since  the  set  of  join  methods  consisting  of  the  join  index  method,  the  sort-merge  method,  and  the 
combination  of  the  two  possesses  the  special  property,  called  separability  which  we  shall  define  later,  wc 
regard  only  those  methods  as  being  available  for  the  design  theory  (the  inner/outer-loop  join  method,  the 
multiple-pass  method,  and  the  link-based  method  are  nonseparablc  join  methods  with  respect  to  this 
separable  set). 

Wc  define  the  influence  of  the  restriction  on  one  relation  to  the  number  of  tuples  to  be  retrieved  in  the 
other  relation  the  coupling  effect  (which  is  similar  in  concept  to  the  feedback  mentioned  in  [YAO  79]). 
Starting  with  a  case  in  which  coupling  effects  between  relations  arc  not  considered,  wc  then  proceed  to  those 
cases  in  which  they  are  included. 

1.5.1  Cases  without  coupling  effects 

Example  l  :  Figure  1-2  describes  two  relations  R1  and  R2  with  their  access  configurations.  Dashed  lines  (/) 
represent  clustering  indexes,  the  dotted  lines  (:)  nonclustering  indexes.  Columns  without  either  type  of  line 
have  no  indexes  defined  for  them.  Wc  would  like  to  find  the  best  method  of  evaluation  — which  the  optimizer 
would  choose  at  query-processing  time,  for  the  following  query: 
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SELECT  Ar  Ar  B2 
FROM  Rr  R2 
WHERE  Rj.A2  =  ’a2*  AND 
R2.B2  =  ’b2’  AND 


B,  B, 


I  /  :  I  I  / 

|  /  :  |  JOIN  |  / 

I  /  :  I  I  / 


R 


2 


Figure  1-2:  Relations  R  L  and  R2. 


For  this  example  only,  it  is  also  assumed  that  all  the  tuples  in  each  relation  participate  in  the  join. 


Given  these  assumptions,  the  optimizer  could  try 
evaluate  the  cost  of  each,  and  then  select  the  one 
combinations: 

Rl 

1.  Join  index  method  (partial) 

2.  Sort-mergc  method  (partial) 

3.  Join  index  method  (partial) 

4.  Sort-mergc  method  (partial) 


all  the  possible  combinations  of  the  join  methods, 
that  costs  the  least  We  have  here  the  following 


Join  index  method  (partial) 
Sort-merge  method  (partial) 
Sort-mergc  method  (partial) 
Join  index  method  (partial) 


Using  the  cost  model  developed  in  Section  1.4,  the  following  formulas  give  the  cost  (number  of  block 
accesses)  for  each  of  the  four  cases  above.  In  each  formula  the  first  and  second  bracketed  expressions 
represent  the  cost  of  accessing  relation  Rp  and  R2  respectively.  Bracketed  expressions  in  the  formulas  are 
given  arbitrary  values  for  illustrative  purposes.  Those  expressions  whose  form  is  identical  arc  given  the  same 
value. 

Cost  =  [IA(1a2,  Rj)  +  IS(IAr  R:)  +  Fa2  X  nR1]  +  :  100  +  (1.1) 

[IA(IB2i  R2)  +  ISOr!*  +  b<mR2’  Pr2’  ^B2  *  nR2^  *  ^ 


I 
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Cost  =  [IA(Ia2,  Rj)  +  Fa2  X  mR1  +  SORT(FA2  X  HR,  X  mR,)l  + 

[1A(IB2.  R2)  +  b(mR2,  pR2.  FU2  X  nR2)  +  SORIXF^  X  HR2  X  mR2)l 

:60  + 

:  50 

(12) 

Cost  =  [IA(Ia2,  R,)  +IS(Ia1,  Rj)  +  Fa2  X  nR,]  + 

[IA(I|}2,  R2)  +  b(inR2,  pR2,  Fb2  X  nR2)  +  SORT(F'B2  X  HR2  X  mR2)J 

:  100  + 

:  50 

(1.3) 

Cost  =  [I A(Ia2.  Rj)  +  h'A2  X  mR1  +  SORT(F'A2  X  HRJ  X  mR1))  + 

[IA(IB2.  ^2)  +  ^2^  +  b<mR2’  PR2’  ^B2  *  nR2^ 

!  60  + 

:  20 

(1.4) 

Here  b(m,p,k)  is  a  function  that  provides  the  number  of  block  accesses,  where  k  is  the  number  of  tuples  to  be 
retrieved  in  TID  order.  An  exact  form  of  this  function  and  various  approximation  formulas  arc  summarized 
in  Chapter  2.  The  function  is  approximately  linear  in  k  when  k  «  n,  and  approaches  m  as  k  become*  large.  A 
familiar  approximation  suggested  by  Cardenas  [CAR  75]  is  b(m,p,k)  =  m  [1  —  (1  —  l/p)k].  FA2  and  FB2  are 
the  selectivities  of  the  columns  RrA2  and  R2.B2,  respectively.  !n  Equation  (1.1),  FA2  X  nR1  and 
b(mR2,pR2,FB2  X  nR2)  represent  the  numbers  of  blocks  accessed  that  contain  data  tuples  of  relation  and 
R2,  respectively.  Since  retrieving  tuples  by  scanning  a  nonclustering  join  index  will  access  the  tuples 
randomly,  the  same  block  will  be  accessed  repeatedly  if  it  contains  more  than  one  tuple.  Therefore  it  is  very 
likely  that  one  block  access  is  needed  to  retrieve  each  tuple.  Hence  we  get  FA2  X  nR1  for  the  number  of  data 
blocks  fetched  from  relation  Rr  Note  that  in  this  case  the  tuples  cannot  be  accessed  in  TID  order.  For 
relation  R2,  however,  the  join  index  is  clustering  and  thus  the  tuples  will  be  retrieved  in  TID  order,  even 
though  they  are  selected  randomly  by  the  restriction.  Therefore,  even  though  a  block  contains  more  than  one 
tuple,  in  all  likelihood  each  block  will  be  fetched  only  once.  We  thus  get  b(mR2’PR2'^B2  X  nR2^  ^or  *c 
number  of  data  blocks  fetched  from  R2,  where  Fg2  X  nR2  is  the  number  of  tuples  selected  by  the  restriction. 

In  Equation  (1.2),  FA2  X  mR1  and  b(mR2,pR2,FB2  *  nR2)  rcPrescnt  numbers  of  blocks  accessed  during 
the  initial  scan  of  the  relation  prior  to  sorting.  Since  the  restriction  index  is  clustering  in  relation  Rr  the 
initial  scan  through  this  restriction  index  will  access  FA2  X  mR1  blocks.  In  relation  R2,  a  nonclustering 
restriction  index  is  used  to  access  the  relation  initially.  This  restriction  results  in  random  distribution  of  TIDs 
of  the  qualified  tuples  over  the  blocks.  Since  these  tuples  are  then  accessed  in  TID  order,  the  access  cost  is 
b(mR2,pR2>FB2  X  nR2). 
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The  factor  HR2  used  in  the  Equation  (1.3)  represents  the  projection  effect  upon  relation  R2.  Since  the 
projection  selects  only  part  of  the  attributes  from  the  relations,  the  tuple  is  usually  smaller  after  projection. 
The  time  required  to  write  the  final  result  is  not  included,  since  it  is  the  same  regardless  of  the  join  method 
used. 

With  the  specific  values  of  the  access  cost  given.  Equation  (1.4)  gives  the  minimum  access  cost.  We  note 
that  the  access  costs  for  each  relation  do  not  depend  on  any  parameter  of  any  other  relation,  and  that  each 
part  of  the  cost  of  Equation  (1.4)  becomes  the  local  minimum.  That  is,  the  first  part  of  the  cost  incurred  by 
accessing  relation  R1  is  the  minimum  of  tfte  costs  of  the  join  methods  used  for  Rr  while  the  second  part  is  the 
minimum  of  those  for  R2.  This  implies  that  the  optimizer  can  determine  the  optimal  join  method  on  one 
relation  without  regard  to  any  properties  of  other  relations.  □ 

The  foregoing  observation  is  extremely  important  because,  if  we  can  determine  the  optimal  join  method  for 
one  relation  without  regard  to  other  relations,  we  can  also  use  the  following  method  to  determine  the  optimal 
access  configuration  for  the  relation  without  regard  to  other  relations: 

1.  try  every  possible  access  configuration  for  a  relation  in  turn. 

2.  for  a  given  access  configuration,  find  the  best  evaluation  method -which  the  optimizer  would 
choose  at  query-processing  time  — for  each  given  query  (this  corresponds  to  the  query 
optimization  problem). 

3.  then  calculate  the  total  cost  for  processing  the  queries,  using  their  expected  frequency  of 
occurrence. 

4.  repeat  this  procedure  for  all  other  possible  access  configurations,  finally  selecting  the  one  that 
yields  the  minimal  total  cost 

The  result  of  this  will  be  to  reduce  designing  an  optimal  access  configuration  of  a  database  to  that  of  a 
single  relation.  Local  optimal  solutions  for  individual  relations  constitute  an  optimal  solution  for  the  entire 
database.  However,  the  foregoing  procedure  of  making  an  exhaustive  search  of  all  the  possible  access 
configurations  could  yet  prove  too  costly,  fherefore,  in  Section  1.6  we  divide  the  design  procedure  into  two 
parts:  choice  of  the  clustering  column  and  index  selection.  We  shall  provide  a  clean  interface  between  the 
two  steps  and  discuss  deviations  from  the  true  optimum. 
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It  should  be  pointed  out  here  that,  despite  our  assumption  that  there  is  no  coupling  effect  between  the  two 
relations  and  despite  the  fact  that  the  above  argument  appears  to  follow  directly  from  that  assumption,  it  will 
be  shown,  in  the  following  discussion,  that  the  problem  is  similarly  reduced  even  when  coupling  effects  are 
actually  present  Before  further  discussion,  we  need  the  following  definition  and  example. 

Definition  1:  The  join  selectivity  J(R,JP)  of  a  relation  R  with  respect  to  a  join  path  JP  is  the  ratio  of  the 
number  of  distinct  join  column  values  of  the  tuples  participating  in  the  unconditional  join  to  the  total  number 
of  the  distinct  join  column  values  of  R.  A  join  path  is  a  set  (Rj.Rj.A.Rj.Rj.B),  where  Rx  and  R2  are  relations 
participating  in  the  join  and  Rj.A  and  R2.B  are  the  join  columns  of  R2  and  R2,  respectively.  An  unconditional 
join  is  a  join  in  which  the  restrictions  on  either  relation  are  not  considered.  □ 

Join  selectivity  is  the  same  as  the  ratio  of  the  number  of  tuples  participating  in  the  unconditional  join  to  the 
total  number  of  tuples  in  the  relation  (cardinality  of  the  relation).  Join  selectivity  is  generally  different  in  Rj 
and  R2  with  respect  to  a  join  path,  as  shown  in  the  following  example: 

Example  2:  Let  us  assume  that  the  two  relations  in  Figure  1-3  have  a  1-to-N  partial-dependency 

relationship.  Partial  dependency  means  that  every  tuple  in  the  relation  R2  that  is  on  the  N-side  of  the 

relationship  has  a  corresponding  tuple  in  R.,  but  not  vice  versa  [F.LM  80].  Let  us  assume  that  50%  of  the 

employees  have  at  least  one  child  each  so  that  the  tuples  representing  those  employees  participate  in  the 

unconditional  join.  Every  tuple  in  the  children  relation  R2  is  assumed  to  have  only  one  corresponding  tuple 

in  Rj  and  all  of  them  participate  in  the  unconditional  join  according  to  the  partial  dependency.  The  join 

selectivity  of  the  employees  relation  is  then  0.5,  while  that  of  the  children  relation  is  1.0  □ 

R  p  Employ ees(E#,  Job,  Age,  Salary) 

R2:  Childrcn(E#,  Name,  Hair-color,  Sex) 

Figure  1-3:  Employees  and  Children  relations. 
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1.5.2  Cases  with  coupling  effects 

Let  us  investigate  the  four  cases  shown  in  Example  1  — using  the  same  query,  join  methods,  and  access 
configuration  defined  as  in  Figure  1-2,  but  now  with  coupling  effects.  In  fact,  we  shall  consider  coupling 
effects  throughout  our  subsequent  discussions.  We  shall  also  assume  that  R ^  and  R2  have  a  l*to-N 
relationship  (1  for  Rx  and  N  for  R2). 

Case  t:  The  join  index  method  is  applied  to  both  relations  Rx  and  Rr  With  coupling  effect,  the  join  will 
be  performed  as  follows:  If  a  tuple  of  relation  R1  does  not  satisfy  the  restriction  predicate  for  Rv  the 
corresponding  tuples  of  R2  that  have  the  same  join  column  values  are  not  accessed.  Hence,  we  have  the 
coupling  effect  from  R ^  to  R2.  If  there  are  only  index-proccssible  predicates  in  the  query  to  be  evaluated,  the 
situation  is  then  symmetric -in  the  sense  that,  for  the  tuples  in  relation  R2  that  do  not  satisfy  the  restriction 
predicate  for  R2,  the  corresponding  tuples  of  Rx  are  not  accessed  either.  We  have  this  symmetry  because  we 
can  resolve  all  index-processible  predicates  by  using  TIDs  only,  without  any  need  to  access  the  data  tuples 
themselves. 

Since  both  RrA2  and  R2.B2  have  indexes  defined  for  them,  the  restriction  predicates  in  the  WHERE  clause 
are  index-processiblc.  Therefore,  the  cost  of  evaluating  this  query,  including  the  coupling  effect,  will  be  as 
follows: 

Cost  =  IIAU^Rj)  +  IS(IA1,R1)  +  {<JX  X  b(l/FB1,FBl  X  nR2, 

^B2  *  *  ^A2  *  nRl^  + 

[IA(IB2»R2)  +  IS(IB1,R2)  +  b(mR2,pR2,{<J2  *  ^A2>  X  Fb2  X  n^})] 

Here  Jl  and  J2  represent  the  join  selectivity  of  relations  Rx  and  R2,  respectively,  for  the  join  path  considered. 
Expressions  in  the  braces  represent  the  numbers  of  data  tuples  accessed  in  relations  R ^  and  R2>  respectively. 
In  the  first  part  of  the  formula,  the  expression  in  the  braces  simultaneously  represents  the  number  of  blocks 
accessed  in  relation  Rr  This  follows  die  argument  shown  in  Example  1. 

Ffil  is  the  selectivity  of  column  R2.BX  and  1/FR1  represents  the  number  of  groups5  of  tuples  that  have  the 
same  join  column  values  in  relation  R2~  which  is  essentially  the  same  as  the  number  of  distinct  join  column 
values. 

5  Group  here  is  very  close  in  concept  to  set  occurrence  in  CODASYL-type  databases. 
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The  expression  b(l/FB1,FB1  *X  nR2,  FR2  X  nR2)  represents  the  number  of  groups  selected  by  restriction 
Although  the  b  function  estimates  the  number  of  block  accesses  in  which  a  certain  number  of  tuples  arc 
randomly  selected,  the  same  function  is  used  for  estimating  the  number  of  logical  groups  selected -if  the 
latter  are  assumed  to  be  of  uniform  size.  Note  that  the  clustering  or  nonclustering  of  tuples  in  a  group  is 
irrelevant.  Ffil  X  nR2,  the  number  of  tuples  in  one  logical  group,  plays  a  role  similar  to  that  of  the  blocking 
factor. 

The  expression  b(l/FB1,  Fm  X  nR2,  Ffi2  X  nR2)Al/FB1)  rePresents  rah°  °f  number  of  groups 
selected  by  restriction  FR2  to  the  total  number  of  groups  in  relation  R2.  Since  every  tuple  participating  in  the 
unconditional  join  in  R{  has  a  unique  join  column  value  and,  accordingly,  exactly  one  corresponding  group  in 
R2  (let  us  recall  that  Rj  is  on  the  1-side  of  the  1-to-N  relationship),  this  ratio  correctly  represents  a  special 
restriction  upon  R{  caused  by  the  coupling  effect  originating  in  R2.6 

In  the  second  part  of  the  cost  formula,  we  simply  use  FA2  to  represent  the  coupling  effect  directed  from  Rx 
to  R2.  Since  in  R{  every  tuple  has  a  unique  join  column  value,  if  a  tuple  is  selected  according  to  the 
restriction,  the  corresponding  group  in  R2  that  has  the  same  join  column  value  (if  it  exists)  will  be  selected  on 
the  basis  of  this  special  restriction  resulting  from  the  coupling  effect  Hence,  FA2  represents  the  ratio  of  the 
number  of  groups  selected  as  a  consequence  of  the  coupling  effect  to  the  total  number  of  groups  in  R2 
participating  in  the  unconditional  join.  ITiat  ratio,  in  turn,  has  the  same  value  as  the  ratio  of  tuples,  selected 
according  to  the  coupling  effect  to  the  total  number  of  tuples  participating  in  the  unconditional  join  in  Rr  □ 

The  coupling  effect  is  formally  defined  as  follows: 

Definition  2:  The  coupling  effect  from  relation  R  j  to  relation  R2,  with  respect  to  a  type  of  query,  is  the  ratio 
of  the  number  of  distinct  join  column  values  of  the  records  of  Rr  selected  according  to  the  restriction 
predicate  for  R^  to  the  total  number  of  distinct  join  column  values  in  Rr  □ 


Note  that  this  ratio  could  be  very  different  from  and  is  always  larger  than  FB2>  cxpccially  when  a  group  is  large.  The  reason  is  that,  if 
at  least  one  tuple  in  a  group  is  selected,  the  corresponding  join  column  value  and  the  corresponding  tuple  in  arc  selected  according  to 
this  special  restriction  resulting  from  the  coupling  effect 
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If  we  assume  that  the  join  column  values  are  randomly  selected,  the  coupling  effect  from  R}  to  R2  is  the 
same  as  the  ratio  of  the  number  of  distinct  join  column  values  of  R2  selected  by  the  effect  of  the  restriction 
predicate  for  Rj  to  the  number  of  distinct  join  column  values  in  R2  participating  in  the  unconditional  join. 

Definition  3:  A  coupling  factor  Cf12  from  relation  Rx  to  relation  R2,  with  respect  to  a  type  of  query,  is  the 
ratio  of  the  number  of  distinct  join  column  values  of  R2,  selected  by  both  the  coupling  effect  from  Rx 
(through  the  restriction  predicate  for  Rx)  and  the  join  selectivity  of  R2,  to  the  total  number  of  distinct  join 
column  values  in  R2.  □ 

According  to  the  definition,  a  coupling  factor  can  be  obtained  by  multiplying  the  coupling  effect  from  R ^ 
to  R2  by  the  join  selectivity  of  R2.  This  coupling  factor  contains  all  the  consequences  of  the  interactions  of 
relations  in  the  join  operation,  since  it  includes  both  coupling  and  joining  filtering  effects.  Let  us  note  that, 
although  the  coupling  factor  can  be  obtained  in  any  case,  it  does  not  always  contribute  to  the  reduction  of  the 
tuples  to  be  retrieved.  We  will  see  an  example  of  this  in  Case  2  below.  A  coupling  factor  is  said  to  be  effective 
if  the  coupling  effect  actually  contributes  to  the  reduction  of  the  tuples  to  be  retrieved.  In  Case  1,  the 
expressions  in  angle  brackets  represent  the  coupling  factors  from  R2  to  Rx  and  from  Rx  to  R2,  respectively,  for 
the  type  of  query  considered.  By  definition,  different  queries  are  of  the  same  type  if  they  are  identical  except 
for  their  literal  values.  The  same  applies  to  update  transactions.  For  example,  INSERT  INTO  Rx  <a,b>  is  of 

the  same  type  as  INSERT  INTO  Rj  <c,d>).  Hence, 

Cf12  =  J2XFA2, 

®2l  =  -*i  x  b(l/FBi*  F01  X  nR2,  FB2  X  nR2)/(l/FB1). 

One  important  observation  here  is  that  the  coupling  factors  do  not  depend  on  the  specific  access  structures 
present  in  cither  relation,  nor  on  the  specific  join  method  selected,  but  rather  (and  solely)  depend  on  the 
restriction  and  the  data  characteristics.  Such  characteristics  include  the  side  the  relation  is  on  in  the  1-to-N 
relationship,  the  average  number  of  tuples  in  one  group,  and  the  join  selectivity  -  which  will  be  known  before 
we  start  the  design  phase. 

Note  that  the  coupling  factors  differ  according  to  the  specific  type  of  query  being  considered.  Different 
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types  of  queries  have  different  join  paths  and  different  combinations  of  columns  in  the  restriction 
predicate- with  consequently  different  selcctivities  for  the  calculation  of  coupling  factors. 

Now  let  us  investigate  the  remaining  cases  in  which  coupling  effects  are  present  between  relations. 

Case  2:  The  sort-merge  join  method  is  applied  to  both  relations,  in  the  same  situation  as  in  Figure  1-2.  The 

cost  formula  is  then  as  follows: 

Cost  =  [Fa2  X  mR1  +  SORT(FA2  X  HR1  X  mR1)J 

+  [IA(1B2,R2)  +  b(mR2,pR2,FB2  X  nR2)  +  SORT(FR2  X  HR2  X  mR2)J 

It  will  be  noted  that  the  coupling  factors  do  not  appear  in  the  cost  formula.  This  is  because,  when  the  sort- 
merge  join  method  is  used,  an  initial  scan  and  the  sort  arc  performed  before  the  join  is  resolved;  indexes  are 
not  used  any  more  while  the  join  is  being  actually  resolved,  since  the  relation  scan  is  performed  upon  the 
sorted  temporary  relations.  The  coupling  effect  can  arise  only  when  the  join  is  being  actually  resolved  and 
only  when  the  join  index  is  used.  Thus,  the  coupling  factor  is  not  effective  in  this  case. 

Case  3:  The  sort-merge  join  method  is  used  for  Rx,  the  join  index  method  for  R2~  in  the  same  situation  as 

in  Figure  1-2.  The  join  will  be  performed  as  described  in  Section  1.3,  under  the  heading:  Combination  of  the 

Join  Index  Method  and  Sort-Merge  Method.  Note  that  the  coupling  factor  is  effective  from  Rj  to  R2.  Thus, 

we  obtain  the  following  cost  formula: 

Cost  =  [Fa2  X  mR1  +  SORT(FA2  X  Hr  X  mR1)J 

+  [1A(1B2,R2)  +  1S(1B1,R2)  +  b(mR2,pR2,Cf12  X  FB2  X  nR2)J 

Case  4:  The  join  index  method  is  used  on  R,,  the  sort-merge  method  on  R2-in  the  same  situation  as  in 

Figure  1-2.  We  obtain  the  following  cost  formula: 

Cost  =  (IAd^Rj)  +  iSdA1.Rj)  +  Cf21  X  FA1  X  nR1l 

+  [1A(1B2,R2)  +  b(mR2,pR2,FB2  X  nR2)  +  SORf(FB2  X  HR2  X  mR2)] 


-  18  - 


SEPARABILITY- AN  APPROACH  TO  PHYSICAL  DATABASE  DESIGN 


1 .5.3  Cates  when  restriction  indexes  are  absent  on  one  relation 

All  four  cases  that  have  been  discussed  so  far  assume  the  same  situation  as  in  Example  1- except  for 
inclusion  of  the  coupling  cffecL  We  still  have  to  consider  more  general  cases  in  which  restriction  indexes  are 
absent  for  the  columns  specified  in  the  predicate  of  the  query  for  one  relation.  The  case  in  which  the 
restriction  indexes  are  absent  in  both  relations  will  be  treated  in  Section  1.5.4.  For  clarity  of  presentation,  let 
us  define  a  shorthand  notation  for  the  cost  formula. 

Definition  4:  Cost(Rk,  Cljk,  type-of-join)  is  the  cost  of  a  join  operation  associated  with  relation  Rk  when  Rk 
has  a  coupling  factor  Cfjk  from  R^  to  Rk,  with  respect  to  the  query  of  interest,  and  the  type-of-  join  is  the  join 
method  used  between  Rk  and  Rj.  □ 

Although  costs  difFer  for  different  access  configurations,  this  shorthand  notation  for  the  cost  function  does 
not  show  that  difference  explicitly,  because  it  is  irrelevant  to  our  subsequent  discussions.  Using  this 


definition,  cost  formulas  for  the  previous  cases  can  be  restated  as 

Case  1:  Cost(RrCf21, Join-index)  +  Cost(R2,Cf12,  Join-index)  (1.5) 

Case  2:  CostfRj.Cfjj.Sort-merge)  +  Cost(R2,Cf12, Sort-merge)  (1.6) 

Case  3:  Cost(Rj,Cf21, Sort-merge)  +  Cost(R2,Cf12, Join-index)  (1.7) 

Case4:  Cost(R1,Cf21,Join-index)  +  Cost(R2,Cf12,Sort-merge)  (1.8) 


If  there  is  no  coupling  effect  between  the  two  relations,  as  in  the  case  of  a  query  that  does  not  impose  a 
restriction  on  a  relation,  say  R2,  then  the  coupling  factor  Cf21  simply  becomes  the  join  selectivity,  Jk  —  if  the 
join  index  method  is  used  for  Rr  The  cost,  in  this  case,  will  be  Cost(R1J1, type-of-join).  When  the  sort- 
merge  join  method  is  used  for  relation  Rk  the  cost  becomes  Cost(Rk,l,sort-merge).  But  it  is  identical  to 
Cost(Rk,Cfjk,sort-merge),  because,  as  wc  observed  in  Case  2,  the  coupling  factor  is  not  used  in  the  cost 
formula.  According  to  the  same  argument,  wc  conclude  that  the  cost  of  the  sort-merge  join  method  can 
always  be  written  as  Cost(Rk, Cf.k, sort-merge). 

Case  1-A:  Let  us  assume  that  the  join  index  method  is  used  for  both  Rj  and  R2,  in  the  same  situation  as  in 
Figure  1-2,  except  that  the  restriction  index  for  column  R^.A2  is  missing.  The  «in  will  be  performed  as 
follows.  First  the  TID  set  R2’  of  the  tuples  that  satisfy  the  restriction  on  R2  is  obtained  by  using  the  restriction 
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on  column  R2.B2.  T1D  pairs  lhat  have  the  same  join  column  values  are  found  by  scanning  the  join  column 
indexes  according  to  the  order  of  join  column  values.  As  it  is  found,  each  T1D  pair  (TID^TIDj)  is  checked  to 
see  if  TID2  is  present  in  R2’.  If  it  is,  the  corresponding  tuple  in  relation  Rj  is  retrieved.  If  this  tuple  satisfies 
the  restriction  upon  Rr  the  corresponding  tuple  in  R2  is  also  retrieved  and  concatenated,  and  the  result 
projected.  Note  that  the  coupling  factors  are  effective  in  both  directions.  Thus,  the  cost  of  evaluating  the 
query  will  be 

Cost  =  [IS(IA1.R1)  +  Cf21  X  nR1] 

+  [IA(IB2,R2)  +  1S(IB1,R2)  +  b(mR2,pR2,Cf12  X  FB2  X  nR2)J 
=  Cost(R1,Cf21  join-index)  +  Cost(R2,Cf12  join-index). 

Note  thaL  since  the  restriction  index  on  column  Rj.A2  is  missing,  the  first  part  of  the  cost  formula  is 
different  from  that  of  Case  1,  but  the  coupling  factors  remain  the  same.  The  case  in  which  R2.B2  is  absent 
instead  of  RrA2  is  treated  similarly  and  will  result  in  the  same  formula  in  the  shorthand  notation. 

Case  2-A:  The  sort-merge  method  is  used  for  both  Rj  and  R2  in  the  same  situation  as  in  Figure  1-2,  except 

that  the  restriction  index  on  the  column  Rj.A2  is  missing.  The  cost  formula  becomes 
Cost  =  [mR]  +  SORT(Fa2  X  Hr1  X  mR1)] 

+  [!A(IB2,R2)  +  b(mR2,  pR2,  Fb2  X  nR2)  +  SORT(FR2  X  HR2  X  mR2)] 

=  Cost(RrCf21, sort-merge)  +  Cost(R2,Cf12,sort-merge) 

The  case  in  which  the  index  on  R2-B2  is  missing  (rather  than  Rj.A2)  is  treated  similarly  and  will  result  in  the 

same  formula  in  the  shorthand  notation. 

Cases  3-A  and  4-A:  The  sort-merge  method  is  used  for  and  the  join  index  method  for  R2,  in  the  same 
situation  as  in  Figure  1-2,  except  that  the  restriction  index  for  the  column  R2.B2  is  missing.  In  this  case,  the 
join  is  performed  as  in  Case  3.  The  only  difference  is  that,  since  indexes  are  now  absent  for  the  restriction 
columns  of  R2,  the  restriction  predicate  for  R2  can  be  resolved  only  after  the  tuples  are  retrieved.  The  cost  of 
evaluating  the  query  becomes 

Cost  =  [IA(IA2,R ^  +  FA2  X  mR1  +  Sort(FA2  X  HR1  X  mR1)] 

+  (ISfl^.R^  +  b(mR2,pR2,Cf21  X  nR2)] 

=  Cost(R  (,Cf2 sort-merge)  +  Cost(R2,Cf12  join-index) 
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In  the  case  in  which  RrA2  is  missing  (rather  than  R2.B2),  it  will  change  the  first  part  of  the  cost  formula  we 
obtained  in  Case  3,  but  will  result  in  the  same  shorthand  form.  The  case  in  which  the  join  index  method  is 
used  on  Rt  and  the  sort-merge  method  on  R2,  as  in  Case  4,  is  treated  similarly. 

1 .5.4  Formalization 

In  all  the  cost  formulas  so  far,  the  coupling  factors  have  been  used  in  both  directions,  -  i.e.,  both  bracketed 
expressions  in  a  formula  were  of  the  form  Cost(Rk,Cf.k,type-of-join).  We  shall  call  the  form  of  these  formulas 
symmetric. 

Join  costs  can  be  written  in  this  form  only  when  the  coupling  factors  are  known  to  be  effective  for  the  join 
method  used  (as  when  the  join  index  method  was  used  in  the  previous  cases),  or  when  the  cost  can  be 
determined  regardless  of  the  coupling  factors  (as  when  the  sort-merge  method  is  used).  The  reason  is  that  the 
only  ambiguity  in  determining  the  cost  of  a  join  is  whether  or  not  the  coupling  factor  will  be  included  in  the 
calculation -since  all  other  information  needed  is  local  and  is  not  affected  by  interaction  with  other  relations. 
If  we  know  at  design  phase  that  coupling  factors  are  effective  or  that  the  cost  is  independent  of  the  coupling 
factor,  we  can  determine  at  design  phase  the  costs  of  various  possible  joins  on  each  relation  and,  using  only 
local  information  and  the  coupling  factors  without  ambiguity,  accordingly  determine  the  best  join  method 
and  its  cost.  There  are,  however,  some  cases  in  which  we  cannot  determine  whether  the  coupling  factors  are 
effective  at  design  phase.  These  will  be  introduced  in  Example  3. 

If  the  best  join  method  can  be  determined  with  only  the  local  information  (the  access  configuration  of  the 
relation  and  the  type  of  join  method  used)  and  coupling  factors,  without  any  regard  to  other  relations,  the 
clear  implication  is  that  we  can  design  an  optimal  access  configuration  of  a  relation  by  using  only  local 
information  and  the  coupling  factors,  independently  of  the  other  relations.  The  design  could  be  performed 
by  the  following  procedure: 

1.  Consider  each  possible  access  configuration  of  a  relation  in  turn 

2.  Find  the  best  join  method  and  its  cost  for  the  particular  configuration 

3.  Repeat  this  procedure  for  other  access  configurations 


-  21  - 


SEPARABILITY  AS  A  PHYSICAL  DATABASE  DESIGN  METHODOLOGY 


4.  Find  the  one  that  gives  the  minimum  join  cost 

The  only  nonlocal  information  used  here  is  furnished  by  coupling  factors.  Lumped  within  them  are  all  the 
interactions  from  other  relations.  B  havet  we  already  observed  that  the  coupling  factors  do  not  depend  on 
access  configurations  of  the  other  relations,  nor  do  they  depend  on  the  join  methods  chosen;  they  depend 
exclusively  on  the  properties  of  given  queries  and  the  data  characteristics  of  the  relation.  Furthermore,  these 
properties  can  be  determined  before  we  start  designing  any  access  configuration  in  the  database. 

We  conclude  here  that  we  can  design  the  access  configuration  of  the  entire  database  optimally  by  designing 
the  optimal  access  configurations  of  individual  relations  one  by  one,  regardless  of  the  remaining 
relations— when  all  the  information  needed  is  known  at  design  time.  The  local  optimum  configurations  will 
collectively  comprise  the  global  optimum  configuration. 

To  formalize  the  foregoing  observation,  we  need  the  following  definitions  and  theorems. 

Definition  5:  A  partial-join  cost  is  that  part  of  the  join  cost  that  represents  the  accessing  of  only  one 
relation,  as  well  as  the  auxiliary  access  structures  defined  for  that  relation.  □ 

In  the  examples  above,  each  expression  in  square  brackets  represents  a  partial-join  cost 

Definition  6:  A  partial-join  algorithm  is  a  conceptual  division  of  the  algorithm  of  a  join  method  whose 
processing  cost  is  a  partial-  join  cost.  □ 

Definition  7:  A  join  method  is  symmetric  under  certain  constraints  if,  under  these,  both  partial-join  costs 
can  be  determined  with  only  local  information  of  the  pertinent  relation  and  the  coupling  factor,  regardless  of 
the  partial-join  algorithm  used  and  the  access  configuration  defined  for  the  relation  on  the  other  side  of  the 
join.  □ 

Definition  8:  A  set  of  join  methods  is  separable  under  certain  constraints,  if  under  these  constraints 

•  Any  partial-join  algorithm  of  a  join  in  the  set  can  be  combined  with  any  partial-join  algorithm  of 
any  join  method  in  the  set,  and 
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•  Any  combination  of  partial-join  algorithms  of  the  join  methods  in  the  set  produces  a  symmetric 
join  method.  □ 

From  the  discussion  at  the  beginning  of  Section  1.5.4,  we  have  the  following  lemma. 

Lemma  1:  A  join  method  is  symmetric  if  and  only  if  its  cost  has  a  symmetric  form.  □ 

Theorem  2:  The  problem  of  designing  the  optimal  access  configuration  of  a  database  can  be  decomposed 
into  the  tasks  of  designing  the  optimal  access  configurations  of  individual  relations  independently  of  one 
another,  if  the  set  of  join  methods  used  by  the  optimizer  is  separable  with  respect  to  the  constraints  imposed 
upon  the  database  system. 

Proof:  Since  the  set  of  join  algorithms  used  is  separable,  we  can  choose  an  arbitrary  combination  of  partial- 
join  algorithms  within  the  set  Thus,  we  can  choose  any  partial-join  algorithm  to  be  used  for  one  relation 
without  regard  to  the  partial-join  algorithm  used  for  the  other  relation.  Furthermore,  since  a  join  method 
consisting  of  any  combination  of  partial-join  algorithms  is  symmetric,  the  partial-join  cost  of  a  partial-join 
algorithm  can  be  evaluated  independently  of  the  partial -join  algorithm  used  and  the  access  configuration 
defined  on  the  other  side  of  the  join.  As  a  result,  the  specific  access  methods  assigned  to  and  the  partial-join 
algorithm  used  for  one  relation  cannot  affect  any  design  parameters  for  the  other  relations.  It  is  therefore 
guaranteed  that  there  will  be  no  interference  among  the  designs  of  individual  relations.  Q.E.D. 

Theorem  2  is  a  generalization  of  the  observation  made  from  Example  1,  except  that  it  now  includes  the 
coupling  effects  between  relations. 

Theorem  3:  The  set  of  join  algorithms  consisting  of  the  join  index  method  and  the  sort-merge  method  is 
separable  under  the  constraint  that  every  column  in  every  relation  in  the  database  must  have  an  index  defined 
for  it 

Proof:  Part  1  of  Definition  8  is  obvious  from  previous  examples  and  cases.  When  the  join  index  method  is 
used  for  both  relations,  all  predicates  are  index-proccssible  since  every  column  has  an  index.  Hence,  all 
predicates  are  resolved  with  TIDs  before  the  relations  themselves  are  accessed;  coupling  factors  are  effective 
in  both  directions;  and  the  cost  formula  has  symmetric  forms.  When  the  sort-merge  method  is  used  for  one 
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relation  and  the  join  index  method  for  the  other,  then,  by  the  same  reasoning  as  in  Case  3,  the  cost  formula 
has  symmetric  forms.  If  only  the  sort-merge  method  is  used,  the  cost  formula  is  always  symmetric.  Therefore, 
from  Lemma  1.  the  theorem  holds.  Q.E.D. 

Only  symmetric  joins  have  been  used  in  the  example  and  cases  presented  so  far.  There  aic,  however, 
instances  of  nonsymmetric  joins. 

Example  3:  Let  us  assume  that  the  join  index  method  is  used  for  both  R1  and  R2,  in  the  same  situation  as  in 

Figure  1-2,  but  that  now  restriction  indexes  for  both  Rx  and  R2  are  missing.  In  this  situation,  since  ihere  are 

no  restriction  indexes,  there  is  no  way  of  resolving  the  restriction  predicate  without  accessing  the  tuples 

themselves.  Therefore,  if  we  access  relation  R2  first,  the  access  cost  would  be 

Costl  -  11S(IA1,R1)  +  il  X  nR1]  +  (IS(IB1,R2)  +  b(mR2,pR2,Cf12  X  nR2)] 

=  Cos^RpJjjoin-index)  +  Cost(R2,Cf12join-index) 

On  the  other  hand,  if  we  access  relation  R2  first,  the  access  cost  would  then  be 

Cost2  =  [IS(IArR1)  +  Cf21  X  nR1l  +  [lS(lfil,R2)  +  b(mR2,pR2J2  X  nR2)] 

=  Cost(RrCf21join-index)  +  Cost(R2,J2join-index) 

Therefore,  we  have  two  expressions  each  for  the  partial-join  cost  of  each  relation  and  we  cannot  determine 
at  the  design  stage  which  of  them  is  cheaper.  Hence,  this  join  method  is  not  symmetric.  The  coupling  factor 
is  ineffective  in  one  direction  in  each  formula,  since  the  join  selectivity  is  used  in  its  place.  The  cost  formula  is 
now  also  asymmetric  relative  to  the  coupling  factors.  □ 

We  can  still  determine  which  of  the  two  expressions  is  cheaper  at  query-processing  time,  but  we  do  not 
have  this  knowledge  when  the  physical  database  is  being  designed.  If  we  want  to  ascertain  the  cheaper 
expression  at  design  time,  we  have  to  analyze  simultaneously  the  relation  on  the  other  side  of  the  join -but 
this  violates  the  definition  of  symmetry.  The  design  of  access  configuration  for  one  relation  is  no  longer 
independent  of  the  other  relations.  The  theory  presented  in  this  paper  depends  entirely  on  the  property  of 
separability,  which  in  turn  depends  on  that  of  symmetry.  The  situation  depicted  in  Example  3  is  an  apparent 
exception  to  our  theory.  However,  in  our  discussion  of  the  index  selection  problem  in  Section  1.6,  the 
justification  on  the  validity  of  our  approach  will  be  amply  reinforced. 
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Theorem  4:  The  set  of  join'methods  consisting  of  the  join  index  method  and  the  sort-merge  method  is 
separable  under  the  constraint  that,  whenever  the  join  index  method  is  used  for  both  relations,  at  least  one 
relation  must  have  indexes  for  all  restriction  columns. 

Proof:  When  both  relations  have  indexes  on  all  restriction  columns,  this  theorem  reduces  to  Theorem  3.  As 
before  when  the  sort-merge  method  is  used  for  both  relations,  the  cost  formulas  are  always  symmetric.  When 
the  join  index  method  is  used  for  one  relation  and  the  sort-merge  method  for  the  other,  then,  by  a  reasoning 
similar  to  Case  3- A,  we  obtain  symmetric  cost  formulas.  If  only  the  join  index  method  is  used  and  one  of  the 
relations,  say  Rp  has  incomplete  restriction  indexes,  the  join  is  performed  as  in  Case  1-A  except  that  the 
restriction  on  Rj  is  now  partially  resolved  by  using  TIDs  before  accessing  the  tuples  in  Rr  We  thus  get 
symmetric  cost  formulas.  By  Lemma  1,  we  prove  this  theorem.  Q.E.D. 

1.5.5  Update  Cost 

We  assume  here  that  the  updates  are  performed  only  on  individual  relations,  although  the  qualification 
part  (WHERE  clause)  may  involve  more  than  one  relation.  Thus,  updates  are  not  performed  on  the  join  of 
two  or  more  relations.  Hence,  if  we  segregate  the  qualification  part  (this  will  be  treated  as  a  query),  the 
remaining  part  of  an  update  transaction  becomes  separable  (the  update  operation  on  one  relation  does  not 
depend  on  the  access  configuration  of  the  other  relations).  Note  that  we  have  assumed  throughout  that  a 
block  containing  tuples  of  a  relation  contains  only  the  tuples  of  that  relation. 

1.6  Design  Algorithm 

In  this  section,  an  algorithm  for  the  design  of  optimal  access  configuration  of  the  database  will  be 
presented. 

1.6.1  Design  Step  1 

Based  mainly  on  the  result  of  Theorems  2  and  3,  the  first  step  of  our  algorithm  is  as  follows: 

Inputs: 

•  Usage  information:  A  set  of  various  types  of  queries  and  update  transactions  with  their  respective 
frequencies. 
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•  Data  characteristics  (for  every  relation  in  the  database):  Size,  blocking  factor,  selcctivities  of  all 
columns,  relationships  with  other  relations  with  respect  to  join  paths,  join  selectivity  with  respect 
to  join  paths. 


Outputs: 

•  Optimal  position  of  the  clustering  column  for  each  relation. 

•  Optimal  combination  of  partial-joins  for  each  type  of  two-variable  query. 


Condition  Assumed: 

•  Every  column  of  each  relation  in  the  database  has  an  index  defined  for  it  Some  of  these  indexes 
will  be  dropped  in  the  subsequent  index  selection  step. 


Algorithm  1: 

1.  Segregate  the  usage  information  in  such  a  way  that,  if  there  is  a  subquery  involving  more  than  one 
relation  in  the  qualification  part  of  an  update  transaction,  it  is  separated  and  its  frequency  is 
included  with  that  of  the  same  type  of  query.  Thereupon,  all  the  remaining  parts  of  the  update 
transactions  will  refer  to  only  one  relation. 

2.  Calculate  the  coupling  factors  with  respect  to  individual  two-variable  queries  for  every  relation  in 
the  database  using  the  given  data  characteristics. 

3.  Pick  one  relation  and  determine  the  optimal  position  of  the  clustering  column  as  follows: 

a.  Assign  the  clustering  property  to  one  column  of  the  relation. 

b.  Given  that  position  of  the  clustering  column,  identify  the  best  partial-join  algorithm  and 
calculate  its  parual-join  cost  for  every  two-variable  query  that  refers  to  this  relation,  using 
the  given  data  characteristics  and  the  coupling  factors. 

c.  Utilizing  the  usage  information  and  the  result  of  Step  b,  calculate  the  total  cost  associated 
with  this  relation.  This  is  done  by  summing  up  all  the  partial-join  costs  identified  in  Step 
b— multiplied  by  their  respective  frequencies -and  all  costs  incurred  by  one-variable 
queries  and  update  transactions  acting  upon  this  relation. 

d.  Shift  the  clustering  property  to  another  column  of  the  relation  and  repeat  Steps  b  and  c. 

e.  Repeat  Step  d  until  all  the  columns  of  the  relation  have  been  considered.  (The  case  in 
which  there  is  no  clustering  column  is  also  considered.  Then  determine  the  one  that  gives 
the  minimal  cost  as  the  clustering  column  (or  none). 

4.  Step  3  is  repeated  for  every  relation  in  the  database.  The  aggregate  of  results  for  all  relations 
comprises  the  global  optimum. 
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A  join  path  can  often  have  a  multiple  column  as  the  join  column  on  cither  relation.  In  such  cases,  we 
consider  the  multiple  join  column  as  a  single  effective  column,  independent  of  its  component  columns. 
Therefore,  according  to  the  condition  in  the  above  algorithm,  this  effective  column  is  considered  to  have  a 
multiple-column  index  defined  for  it  (we  do  not  consider  here  additional  problems  involved  in  the  multiple- 
column  indexes). 

Although  in  some  cases  improvement  can  be  obtained  by  an  adjustment  in  ordering  among  the  effective 
column’s  component  columns  and  by  the  deletion  of  overlapping  indexes,  this  is  not  being  considered  here. 
It  will  be  noted  that,  under  the  assumptions  given,  the  Design  Step  I  algorithm  yields  a  mathematically  true 
optimum. 

1 .6.2  Design  Step  2:  Index  Selection 

In  the  algorithm  for  Design  Step  1,  we  imposed  the  restriction  that  every  column  of  the  relations  in  the 
database  must  have  an  index  defined  for  it  However,  not  every  index  is  beneficial.  Some  indexes  can 
increase  the  total  access  cost  because  of  their  own  access  and  update  costs. 

The  index  selection  problem  has  been  extensively  studied  by  [KIN  74]  [SCH  75]  [HAM  76].  It  concerns  the 
method  of  selecting  a  set  of  indexes  that  will  minimize  the  processing  cost  in  a  single-relation  environment 
Here  we  are  using  a  slightly  modified  version  of  the  approach  introduced  in  [HAM  76].  The  main 
modification  involves  translating  the  frequency  of  a  partial-join  into  an  equivalent  frequency  of  a  one-variable 
query  (single-relation  restriction).  The  following  example  should  clarify  the  procedure: 

Example  4:  Let  us  consider  the  same  query  and  situation  as  in  Figure  1-2,  except  that  now  both  indexes  for 
R2.B1  and  R2.B2  arc  nonc'l,stcDnB*  When  we  use  the  join  index  method  for  both  relations,  the  partial-  join 
cost  of  the  partial-join  for  relation  R2  becomes 

Cost  =  [IA(IB2,R2)  +  IS(Im,R2)  +  Cf12  X  X  nR2]  (1.9) 

However,  if  we  refine  our  assumption  so  that  tuples  having  the  same  join  column  value  are  now  accessed  in 

TID  order  (we  have  ignored  this  fact  so  far  for  the  sake  of  simplicity),  the  cost  formula  becomes 

Cost  =  [IA(IB2,R2)  4-  IS(IB1,R2)  +  (Cf12/FB1)  X  b(mR2,  pR2,  Ffil  X  FR2  X  nR2)]  (1.10) 
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Here  (Cf12/Ffil)  is  the  number  of  distinct  join  column  values  selected  by  the  coupling  factor  and  the  function 
b  represents  the  cost  of  accessing  the  data  tuples  that  have  specific  join  column  values  and  satisfy  the 
restriction  predicate.  We  used  the  b  function  because  those  tuples  that  have  the  same  join  column  values  are 
accessed  in  TID  order.  If  tuples  with  the  same  join  column  values  were  accessed  randomly,  we  would  obtain 
Equation  (1.9).  Although  Equation  (1.10)  is  only  a  small  refinement  over  Equation  (1.9),  it  makes  it  easy  to 
observe  that  the  data  tuple  access  cost  of  a  nartial-join  can  be  regarded  as  the  joint  restriction  cost  of  the  join 
index  and  the  restriction  index,  multiplied  by  the  factor  (Cf12/FB1).  We  call  this  factor  the  equivalent 
restriction  frequency  of  a  partial-join.  Let  us  note  that  the  function  b  in  the  above  formula  yields  exactly  the 
same  cost  as  would  the  joint  restriction  of  two  indexes. 

More  importantly,  it  can  be  shown  that  the  gain  in  access  cost  by  having  the  restriction  index  in  a  partial- 
join -assuming  the  join  index  is  always  present— is  equal  to  the  gain  in  access  cost  that  the  same  restriction 
index  would  yield  in  the  joint  restriction,  multiplied  by  the  equivalent  restriction  frequency.  The  same 
observation  holds  with  certain  limitations  when  one  of  the  indexes  is  clustering.  A  more  detailed  treatment 
can  be  found  in  Appendix  B.  □ 

Definition  9:  The  equivalent  restriction  frequency  of  a  partial-  join  is  defined  as  the  ratio  of  the  gain  in 
access  cost  by  having  the  restriction  indexes  in  a  partial-join  to  the  gain  in  access  cost  that  the  same  restriction 
indexes  would  yield  in  the  joint  restriction  with  the  join  index  if  the  join  index  is  used  in  the  partial-join  (i.e,  if 
the  join  index  method  is  used),  or  in  the  restriction  of  the  restriction  indexes  alone  if  the  join  index  is  not  used 
(i.e.,  if  the  sort-merge  method  is  used).  □ 

According  to  this  definition,  the  equivalent  restriction  frequency  of  a  partial-join  using  the  sort-merge 
method  is  1,  if  the  restriction  indexes  are  used  to  access  the  relation  initially  before  sorting,  and  0  otherwise. 

Since  the  preceding  discussion  is  not  concerned  with  the  index-accessing  cost,  wc  use  the  equivalent 
restriction  frequency  only  to  estimate  the  ranking  of  indexes  in  importance,  as  will  be  explained  later.  We 
shall  utilize  partial-join  cost  formulas  in  our  actual  cost  calculation. 
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Following  is  the  algorithm  for  Design  Step  2.  This  algorithm  is  mainly  based  on  the  above  discussion,  and 
i  Theorems  2  and  4. 


Inputs: 

•  Outputs  from  Design  Step  1:  optimal  position  of  the  clustering  column  for  each  relation  and 
optimal  combination  of  partial -joins  for  each  type  of  two-  variable  query. 

•  Set  of  types  of  one-variable  queries  and  update  transactions  of  interest  with  their  respective 
frequencies.  Here  each  type  of  one-variable  query  represents  any  Boolean  combination  of  simple 
predicates.  A  simple  predicate  is  one  that  refers  to  only  one  column  of  the  relation. 

•  Data  characteristics  similar  to  the  ones  used  in  Design  Step  1,  but  only  those  parameters  that 
pertain  to  single  relations  are  relevant. 


Outputs: 

•  Set  of  indexes  of  each  relation  that  gives  the  minimum  processing  time. 


Algorithm  2: 

1.  Select  one  relation 

2.  From  the  information  outputted  in  Design  Step  I,  calculate  the  equivalent  restriction  frequency  of 
each  partial-join  involving  this  relation. 

3.  From  the  usage  information  for  one-variable  queries  and  the  equivalent  restriction  frequencies 
calculated  above,  compute  the  total  frequency  f  of  references  to  each  column. 

4.  Rank  the  importance  of  the  columns,  using  f  X  m  X  (1  -  F),  where  m  is  the  total  number  of 
blocks  of  the  relation  and  F  is  the  selectivity  of  each  column.  The  above  formula  represents  an 
upper  bound  on  the  number  of  block  accesses  saved  by  the  restriction  index,  in  the  sense  that  it 
represents  the  number  of  block  accesses  saved  if  there  is  no  other  index  and  all  the  selected  tuples 
are  clustered  [HAM  76). 

5.  If  a  join  index  has  ever  been  used  in  Design  Step  1  -  that  is,  if  at  least  one  partial-join  uses  the  join 
index  method  — then  assign  an  index  io  that  column  by  default,  litis  is  a  heuristic  wc  use  to  avoid 
strong  interference  between  Design  Step  1  and  Design  Step  2  (Wc  assume  that  column  domains 
are  rigorously  defined,  and  that  joins  are  limited  to  semantically  appropriate  columns  [WIE  79]). 
If  the  join  index  used  in  Design  Step  1  were  dropped  in  Design  Step  2,  wc  would  have  to  switch 
all  the  partial-joins  that  used  this  now  nonexistent  join  index  to  the  sort-merge  method.  The 
result  would  be  to  distort  the  entire  cost  calculation  that  was  performed  in  Design  Step  1. 

6.  Select  indexes  incrcmcntly  one  by  one,  ordered  by  rank.  Include  only  those  indexes  that  reduce 
the  total  cost  If,  during  the  cost  calculation,  a  query  type  represents  a  partial-join  rather  than  a 
one-variable  query,  the  partial-join  cost  is  used  instead  of  the  joint  restriction  cost. 
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1.6.3  Separability  in  Design  Step  2 

The  implicit  meaning  of  the  index  selection  is  that  those  indexes  that  do  not  compensate  for  their  own 
maintenance  and  access  cost  should  be  dropped.  In  Design  Step  2  we  again  considered  relations  singly  and 
independently  of  one  another.  This  was  based  on  the  separability  theory  of  Theorem  2.  i.e.,  that  the  access 
structures  assigned  to  one  relation  do  not  affect  cost  calculations  for  other  relations.  However,  since,  in 
contrast  to  Design  Step  1,  we  arc  eliminating  some  indexes,  we  can  encounter  situations  that  were  excluded  as 
exceptions  in  Example  3  and  Theorem  4.  In  these  situations,  calculation  of  cost  is  no  longer  separable. 
Nevertheless,  it  turns  out  the  calculativc  error  caused  by  the  assumption  of  separability,  even  in  thest 
exceptional  situations,  is  not  significant. 

If  we  look  at  Example  3  again,  the  actual  cost  at  query -processing  time  will  be 
Cost  =  min(Costl,  Cost2) 

=  min[{Cost(R1,J1  join-index)  4-  Cost(R2,Cf12join-index)}, 

{Cost(RrCf21, join-index)  4-  Cost(R2,J2join-index)}] 

But,  because  we  assumed  symmetry,  the  sum  of  the  costs  we  used  implicitly  in  Design  Step  2  is 
Cost’  =  Cost(RrCf12 join-index)  4-  Cost(R2,Cf21  join-index) 

Thus,  the  total  error  in  cost  estimation  will  be 

Error  =  g  X  (Cost  -  Cost*)  (1.11) 

=  g  X  mintlCostCR^J^oin-index)  -  Cost(RrCf21  join-index)} 

(Cost(R2,J2join-index)  -  Cost(R2,Cf12join-index)}] 

,  where  g  is  the  frequency  of  this  join. 

Remember,  however,  that  the  restriction  indexes  for  both  relations  had  been  dropped  because  their 
benefits  did  not  compensate  for  their  update  and  access  cost.  Hence,  it  must  be  cither  that  the  frequency  of 
access  to  the  column  is  not  significant,  or  that  the  effect  of  selectivity  is  small.  Therefore,  cither  the  frequency 
of  the  join  we  arc  concerned  with  is  insignificant  or  the  coupling  factor  approaches  the  join 
selectivity -making  the  error  insignificant  (see  Equation  (1.11)).  Following  this  argument,  we  claim  that 
separability  can  be  applied  to  all  the  cases  of  concern  without  causing  any  significant  error.  Similar  situations 
arise  when,  on  both  relations,  only  some  of  the  restriction  columns  specified  in  a  query  have  indexes  assigned, 
while  others  do  not.  A  similar  argument  holds  for  such  cases. 
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The  costs  of  some  partial-joins  that  use  the  join  index  method  may  be  changed  as  a  result  of  the  removal  of 
some  indexes  in  Design  Step  2.  This  could  make  the  sort-merge  method  more  feasible.  However,  according 
to  an  argument  analogous  to  the  one  above,  we  claim  that  the  total  of  errors  incurred  in  such  situations  could 
not  be  significant;  otherwise  the  index  would  not  have  been  dropped. 

1 .7  Extensions  and  Further  Study 

An  extension  of  nonseparable  joins,  for  instance,  the  inner/outer-loop  join  method  and  the  multiple-pass 
method  described  in  Section  1.3,  could  be  made  by  means  of  the  following  heuristic  method.  After  Design 
Step  1,  each  type  of  two-variable  query  is  considered  in  turn  and  its  join  cost,  as  determined  in  Design  Step  1, 
is  compared  with  possible  nonseparable  joins.  If  a  nonseparable  join  is  cheaper,  that  query  type  should  be 
marked  to  note  that  this  nonseparable  join  must  be  used.  For  a  possible  shift  of  the  clustering  column,  after 
completion  of  this  step,  Design  Step  1  should  be  repeated- with  the  join  method  for  a  marked  query  type 
fixed  to  be  the  nonseparable  join  method  assigned  previously.  This  whole  procedure  (Design  Step  1  and  the 
refinement  step  with  nonseparable  join  methods)  is  repeated  until  the  refinement  becomes  insignifanL 

The  link  structure  [BLA  76]  can  be  considered  next.  For  every  join  path,  the  total  cost  of  all  queries  using 
this  join  path  is  compared  with  the  cost  based  on  a  hypothetical  link.  If  the  latter  is  less,  a  link  is  assigned  to 
that  join  path.  If  the  join  column  on  the  N-side  relation  of  the  1-to-N  relationship  is  a  clustering  column,  the 
link  is  endowed  with  the  clustering  property -otherwise  not. 

The  most  attractive  prospects  for  the  inner/outer-loop  join  methods  are  those  queries  that  use  the  sort- 
merge  method  for  the  relation  on  the  1-side  of  the  1-to-N  relationship,  but  use  the  join  index  method  for  the 
other  side.  Use  of  the  inner/outer  loop  join  method  in  these  cases  has  the  advantage  of  saving  sorting  time  on 
one  relation  and  indcx-scarching  time  on  the  other  (if  it  has  a  strong  coupling  factor).  On  the  other  hand,  join 
paths  that  support  many  queries  using  the  inner/outer-loop  join  method  would  be  the  most  promising 
prospects  for  the  link  structure.  Index  selection  could  be  done  at  the  conclusion  of  these  steps.  By  reinforcing 
the  foregoing  approach  with  improved  arguments  and  effective  heuristics,  we  look  forward  to  extending  our 
basic  theory  to  queries  of  more  titan  two  variables. 
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Finally,  although  we  have  developed  our  theory  in  terms  of  the  relational  system,  it  should  be  pointed  out 
that  the  basic  concept  of  separability  is  applicable  to  network  database  systems  as  well  (Theorem  2  holds  for 
any  system,  while  Theorems  3  and  4  are  relevant  only  for  relational  systems). 

1.8  Conclusion 

It  has  been  observed  and  proved  that,  with  a  separable  set  of  join  methods,  the  problem  of  designing  the 
optimal  physical  database  can  be  reduced  to  one  of  designing  optimal  individual  relations.  This  <  an  be  done 
independently  of  one  another  by  using  the  coupling  factors  that  represent  all  interactions  among  the  relations. 
This  substantially  diminishes  the  complexity  of  the  problem  by  partitioning  it  into  disjoint  subproblems.  The 
task  is  made  even  more  manageable  by  dividing  the  procedure  into  two  steps -one  for  determining  the 
optimal  positions  of  clustering  columns,  the  other  for  index  selection.  A  proper  interface  between  the  two 
steps  was  introduced. 

Design  Step  1  results  in  a  true  mathematical  optimum.  Although,  because  of  the  heuristics  used  in  Design 
Step  2  and  for  the  interface  between  the  two  steps,  the  overall  design  does  not  provide  a  true  optimum,  it  was 
argued  that  the  deviation  would  be  insignificant. 

The  key  objective  of  this  paper  is  to  propose  a  formal  approach  to  the  design  of  physical  databases  that 
simplifies  the  problem  considerably  and,  at  the  same  time,  provides  better  insight  into  underlying 
mechanisms.  We  believe  that  this  novel  approach  can  enable  substantial  progress  to  be  made  in  the  optimal 
design  of  multifile  physical  databases. 
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2.  Estimating  Block  Accesses  in  Database 

Organizations  -  A  Closed  Noniterative  Formula 

2.1  Introduction 

In  evaluating  the  access  cost  of  a  query  for  a  database  organization  in  which  records  are  grouped  into 
blocks  in  secondary  storage  [WIE  77],  one  must  often  estimate  the  number  of  block  accesses  required  to 
retrieve  the  records  selected  by  the  query.  Various  formulas  have  been  proposed  for  this  purpose  [CAR  75] 
[ROT  74]  [SEV  72]  [SIL  76]  [WAT  72]  [WAT  75]  [WAT  76]  [YAO  77]  [YUE  75].  In  particular,  Yao  [YAO  77] 
presented  the  following  theorem:7 

Theorem  1:  [Yao]  Let  n  records  be  grouped  into  m  blocks  (1  <m<n),  each  containing  p  =  n/m  records.  If 
k  records  are  randomly  selected  from  the  n  records,  the  expected  number  of  blocks  hit  (blocks  with  at  least 


one  record  selected)  is  given  by 

b(m,p,k)=m[l  —  (J  V(J)1  (2.1) 

=  m  [1  -  ((n  -  p)!(n  -  k)!)/((n  - p  -  k)!n!)]  (2.2) 

=  m  [1  -  n*=1(n-p-i+l)/(n-i+l)]  (2.3) 

when  k<n-p,  and 

b(m,p,k) = m  when  k  >  n  -  p.  (2.4) 

Earlier  Cardenas  [CAR  75]  suggested  the  formula 

bc(m,p,k)  =  m  [1  -  (1  -  l/m)k],  (2.5) 


assuming  that  there  are  n  records  divided  into  m  blocks  and  that  the  k  records  are  randomly  selected  from  the 
n  records.  It  is  interesting  to  note  that  Eq.  (2.5)  is  independent  of  the  blocking  factor  p. 

Yao  [YAO  77]  showed  that  Eq.  (2.5)  is  based  on  the  assumption  that  records  arc  selected  with  replacement , 
i.c.,  a  record  can  be  selected  more  than  once.  But  this  assumption  docs  not  hold  in  practice,  since  records 
selected  by  a  query  simultaneously  must  be  distinct  from  one  another.  Yao  eliminated  this  assumption  and 
proved  Theorem  1  under  the  assumption  that  records  are  actually  selected  without  replacement ,  i.e.,  a  record 
cannot  be  selected  more  than  once  at  one  time. 

7The  notation  and  some  of  the  conditions  have  been  slightly  modified. 
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Theorem  1  gives  the  exact  formula  under  the  given  assumptions.  However,  we  notice  that  Eq.  (2.3)  has  an 

iterative  form,  which  will  take  excessive  time  to  evaluate  if  k  becomes  large.  Another  way  of  evaluating  Yao’s 

formula  is  by  using  the  Gamma  function  (in  practice  a  Log  Gamma  (LGAM)[1BM  70]  function  should  be 

used,  since  the  Gamma  function  grows  very  steeply).  By  modifying  Eq.  (2.2)  slightly,  we  obtain 

b(m,p,k)  =  m  [1  -  cxp(LGAM(n-p)  +  LGAM(n^k)  -LGAM(n-p-k)  -LGAM(n))].  (2.6) 

Evaluation  of  this  formula  poses  a  problem  in  practice,  especially  when  k  is  small.  Since,  in  the  evaluation  of 

the  argument  of  the  exponential  function,  we  are  subtracting  big  numbers  from  equally  big  numbers  to  get  a 

very  small  number,  the  roundoff  error  of  the  computation  can  become  intolerable.  For  example,  when  Eq. 

(2.6)  is  calculated  by  using  single-precision  variables  on  a  36-bit  machine  having  the  resolution  of  2“ 27 

(c^10_8)[DEC  78],  it  has  a  46%  error  at  p  =  10,  m=  1000,  n  =  10000,  and  k  =  2.  The  roundoff  error  i$  310% 

when  p=10,  m-3162,  n  =  31620,  and  k  =  3.  But  these  values  of  parameters  are  well  within  the  range  of 

relevant  databases. 

We  propose  below  a  closed  noniterative  formula  that  approximates  Yao's  exact  formula  with  reasonable 
accuracy,  as  well  as  reducing  considerably  the  computation  error  caused  by  limited  precision. 

2.2  A  Noniterative  Formula 

In  this  section,  we  introduce  the  following  formula  and  discuss  how  it  was  obtained.  Errors  of  this  formula 

will  be  discussed  in  Section  2.3.  We  assume  throughout  that  m  and  k  have  only  integer  values. 

bw)(m,p,k)/m  =  [1  -  (l-l/m)k]  (2.7) 

+  [l/m2p  X  k(k-l)/2  X  (l-l/m)k-1] 

+  [1.5/mV  X  k(k-  l)(2k-  I)/6  X  (l-l/m)k-1J 
when  k<n-~p,  and 

bWi(m,p,k)/m  =  1  whenk>n-p  (2.8) 

Let  us  sec  how  Hq.  (2.7)  has  been  derived.  When  k  >  n-p,  we  always  have  bwl(m,p,k)/m  =  1  from  Eq.  (2.4). 
If  we  use  n  =  mp,  Hq.  (2.3)  can  be  transformed  to  an  equivalent  form 

b(m,p,k)/m  =  1  -  nj^O  —  l/m(l  -  i/mp))  (2.9) 

If  we  perform  a  series  expansion  on  l/m(l  -  i/mp)  and  take  only  the  first  three  terms,  we  obtain 

b(m,p,k)/m  ~  1  -  njtro1((l-l/m)-i/m2p-i2/m3p2) 

If  wc  expand  the  multiplication  and  keep  the  first  three  terms,  we  get 
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b(m,p,k)/m  ~  [1  -  (1-  l/m)k]  (2.10) 

+  [l/m2p  X  k(k-l)/2  X  (l-l/m)k-1] 

+  [l/m3p2  X  k(k-  lX2k- 1)/6  X  (l-l/m)k_1]. 

Eq.  (2.10)  is  only  an  approximation  of  Eq.  (2.9),  since  we  took  only  a  few  terms  from  the  expansions.  Two 
factors  were  added  to  Eq.  (2.10)  to  derive  Eq.  (2.7).  The  factor  1.5  has  been  introduced  empirically  to 
compensate  for  the  errors  at  small  values  of  p,  i.e.,  p~  1 .  It  was  chosen  especially  to  reduce  the  error  to  zero 
when  p  =  1,  k  =  [n  — pj,  as  n  goes  to  infinity  (n— »oo),  in  which  case  Eq.  (2.10)  has  the  most  significant  error. 
The  factor  1/p2  has  been  introduced  empirically  to  reduce  the  effect  of  the  third  term  for  higher  values  of  p, 
for  adding  the  third  term  at  these  values  of  p  increases  the  error  (although  it  reduces  the  error  at  lower  values 
of  p).  We  shall  show  later  that  the  approximation  formula  derived  here  constitutes  a  practically  negligible 
deviation  from  the  exact  formula. 

2.3  Error  Analysis 

We  note  that  the  first  term  of  Eq.  (2.7)  is  identical  to  Cardenas’  formula,  Eq.  (2.5).  The  second  term 
compensates  for  the  major  error  of  Eq.  (2.5),  while  the  third  term  provides  a  finer  adjustment  to  further 
reduce  the  error.  The  third  term  has  been  empirically  modified  to  get  a  better  approximation. 

Derived  in  ITieorcm  2  and  plotted  in  Figure  2-1  for  various  values  of  p  and  k= k/n  is  a  formula  that  gives 
the  limiting  values  of  the  error  ERR(m,p,k)  =  (b(m,p,k)  -  bwl(m,p,k))/b(m,p,k)  as  the  total  number  of  blocks 
m  (and,  accordingly,  the  total  number  of  records  n)  goes  to  infinity. 

Theorem  2: 

m^oo  ERR(m,p,k)  =  1  - (1  - e ~ PK(1 — Pk2/2 - «:3/2p))/(l - (1  - #c)p),  (2.11) 

where  HRR(m,p,k)  =  (b(m,p.k)— bwl(m,p,k))/b(m,p,k),  and  p  and  k  have  fixed  values. 

Proof:  To  derive  this  formula,  we  need  the  following  form  of  Yao’s  formula,  which  has  the  iteration  on  the 
blocking  factor  p  rather  than  on  the  number  of  selected  records  k, 

b(m,p,k)  =  1  -  nP=  j((n  -  k  -  i  +  l)/(n  -  i  + 1))  (2.12) 

This  formula  is  easily  derivable  from  Eq.  (2.2).  If  we  subtract  Eq.  (2.7)  from  Eq.  (2.12)  and  divide  the  result 
by  Eq.  (2.12),  we  can  obtain  Eq.  (2.11)  by  taking  the  limit  as  m-+oo(accordingly  n-*oo)  and  by  using  the 
identity  - 1  /rn)m  =  e  ' 1 .  Q.E.D. 
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Eq.  (2.12)  is  also  a  convenient  formula  for  evaluating  the  exact  value  when  we  have  integer  blocking 
factors.  In  fact,  all  computed  values  for  integer  blocking  factors  that  we  shall  employ  later  in  this  section  were 
produced  by  using  Eq.  (2.12).  The  limiting  values,  as  the  blocking  factor  p  goes  to  infinity  with  m  and  k  fixed, 
are  proved  to  be  zero  in  the  following  theorem. 

Theorem  3:  pr^ooHRR(m,p,k)=0,  where  m  and  k  =  k/n  have  fixed  values.  Here  0<k<(m-  l)p,  and  k  is 
an  integer. 

Proof:  If  *c  =  0,  both  b(m,p,k)/m  and  bwi(m,p,k)/m  simply  become  1,  so  ERR(m,p,k)  must  be  zero.  If 
k  >  0,  we  know  that  since  at  *east  one  block  must  be  hit.  Therefore,  the  denominator  of 

ERR(m,p,k)  is  always  at  least  1  and  cannot  be  0.  To  study  the  behavior  of  the  numerator,  let  us  look  at  Eq. 
(2.12).  In  Eq.  (2.12),  (n-k-i  + l)/(n-i  +  l)<(n-k)/n  =  l-K<l.  Therefore, 

p|^>,00n[)=i((n-1c-i+  1)/(n-i  +  1))<p|^00(1  — k)p  =  0.  Thus,  J'!!»00b(m,p,k)=  1.  But  it  is  clear  from  Eq. 
(2.7)  that  J'^0Obwl(m,p,k)=l  also,  since  J'H>00(l-l/m)*mp-1=^00(l-l/m)'cmp=  ^ooe-''p=0,  and 
an  exponential  order  can  suppress  any  polynomial  order  of  p.  Hence,  lpHvOOERR(m,p,k)=0.  Q.E.D. 

The  errors  that  occur  when  both  n  and  p  are  finite  were  investigated  by  performing  an  exhaustive  computer 
calculation.  These  analyses  show  that  Eq.  (2.7)  yields  at  most  3.7%(-3.7%  if  the  sign  is  considered)  of  deviation 
from  the  exact  formula,  Eq.  (2.3),  over  the  entire  range  of  p>l,  m>l,  0<k<n-p,  where  m  and  k  are 
integers.  This  maximum  error  occurs  at  p  =  l+  V  2,  k  =  [n  —  pj  as  m-^oo  (This  can  be  observed  in  Figure  2- 
1.  In  fact  the  maximum  error  and  the  value  of  p  at  which  this  error  occurs  can  also  be  derived  from  Eq.  (2.11), 
once  we  know  that  this  occurs  at  k  =  1,  as  m— >00.)  The  maximum  positive  error  (2.5%)  occurs  at  p  =  1.5, 
k  =  3,  and  m  =  3.  The  maximum  positive  error  when  m-*oO  is  2.1%  at  p  =  1.7  and  k=0.65n. 

The  dependence  of  the  error  on  the  values  of  n/p  =  m  is  shown  in  Figure  2-2,  where  k  is  set  to  be  equal  to 
[n-pj  (note  that  the  maximum  error  occurred  at  this  k  value).  At  low  values  of  m  and  p  there  is  a  short 
range  within  which  errors  are  changing  by  a  large  amounts,  since  at  these  values  of  m  and  p, 
k  =  [n  -  pj  =  [(m  -  l)pj  is  in  the  range  where  high  positive  errors  occur,  as  we  sec  in  Figure  2-1  (sec  the  value 
when  p  =  2,  m  =  3,  n  =  6,  and  k  =  4,  for  example).  Yh c  dependence  of  the  error  on  m  is  otherwise  very  flat,  as 
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in  Figure  2-3,  which  shows  the  values  when  k  =  0.65  with  corresponding  k  values  rounded  to  the  nearest 
integers.  In  Figure  2-3  the  values  at  m  =  1  and  m  =  2  are  0  from  Eq.  (2.4)  and  Eq.  (2.8),  since  at  these  points 
k=0.65n  >  n-p. 

The  values  of  variables  we  used  in  the  exhaustive  computer  calculation  are  as  follows,  with  the  constraint 
that  mp<  107(106  for  noninteger  blocking  factors): 

•  m:  1, 2,  3, 10,  32, 100,  316, 1000,  3162, 10000,  31623, 100000, 316228, 1000000 

•  p:  1, 2,  3, 4, 5, 10,  32,  100, 316, 1000,  3162;  1.1, 1.2 . 1.9;  2.1,  2.2 . 2.9 

•  k/n:  0.0, 0.02, 0.05, 0.1, 0.15,  0.2 .  1.0 

•  k:  [n-pj,  1, 2,  3, 4, 5,  6, 7,  8, 9, 10,  32, 100 

2.4  Computational  Error  due  to  Limited  Precision 

The  major  computational  error  is  due  to  the  evaluation  of  (1  —  1/m)  in  Eq.  (2.7).  For  example,  if  m=  106, 
we  need  better  resolution  than  10  “6.  However,  it  is  shown  in[WHA  81]  that  the  number  of  valid  digits 
required  by  Eq.  (2.7)  is  roughly  proportional  to  log10(m),  while  that  required  by  Eq.  (2.6)  using  the  Gamma 
function  is  proportional  to  log10(mn  ln(n))  for  the  same  precision  in  the  result  In  the  exhaustive  calculation 
using  a  DEC  System  20  with  single-precision  variables,  we  obtained  a  maximum  error  of  0.2%  when  m=  106 
over  the  range  of  variables  shown  in  Section  2.3. 


2.5  Comments  on  Related  Work 

Formulas  essentially  identical  to  Cardenas’  and  Yao’s  formulas  were  derived  independently  by  Waters  and 
Karayiannis  [WAT  72]  [WAT  75]  [WAT  76].  Waters  summarized  three  related  formulas  in  [WAT  76],  which 
are 


»WATi(m,p,k)  =  m[l  -  (1  — k/n)pl. 

(2.13) 

»WAT2(m-P>k)  =  m  [1  -  (1  -p/n)k],  and 

(2.14) 

'wAT3(m’P’k)  =  ml1  -  nf=10  -  p/(n-i+l))]. 

(2.15) 

Eq.  (2.14)  and  Eq.  (2.15)  arc  identical  to  Eq.  (2.5)  and  Eq.  (2.3),  respectively.  Eq.  (2.13)  was  derived  in 
[WAT  72]  [WAT  75],  as  follows: 
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ERR(m,p,k)  (%) 


TO  TOTAL  NO.  OF  RECORDS 


Figure  2-1:  Error  of  Eq.  (2.7)  as  m  Goes  to  Infinity. 
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Figure  2-3:  Error  of  Eq.  (2.7)  when  k  =  0.65n 
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RHR  =  number  of  distinct  records  hit  /  total  number  of  records  in  the  file 
=  probability  that  any  particular  record  is  hit 
=  k/n. 

1  -  RHR  =  probability  that  any  particular  record  is  not  hit 
(1  -  RHR)P  =  probability  that  any  particular  block  is  not  hiL 
1-(1-RHR)P=  probability  that  any  particular  block  is  hit 

Subsequently,  during  one  of  Waters'  lectures,  Karayiannis  (then  a  student)  suggested  that  Eq.  (2.13)  was 
incorrect,  pointing  out  that  Eq.  (2.13)  gives  an  incorrect  result  where  m  =  1  (correct  result  is  b(m,p,k)=  1  if  k  > 
0).  He  further  suggested  Eq.  (2.14)  as  an  alternative  formula.  I^atcr  Waters  [WAT  76]  announced  that  Eq. 
(2.13)  and  the  above  derivation  were  incorrect  and  instead  suggested  Eq.  (2.15)  as  an  alternative  formula. 

However,  we  note  that  the  derivation  of  Eq.  (2.13)  is  correct  if  we  make  the  independence  assumption  in 
calculating  the  probability  that  any  particular  block  wii!  not  be  hit  More  rigorous  derivation  should  use 
conditional  probability ,  since  the  events  of  each  record’s  being  hit  are  not  mutually  probabilistically 
independent 

We  note  that  if  we  interchange  p  and  k,  Eq.  (2.12)  bears  the  same  relationship  with  Eq.  (2.13)  as  Eq.  (2.3) 
does  with  Eq.  (2.14).  In  this  sense,  Eq.  (2.12)  and  Eq.  (2.13)  are  a  dual  of  Eq.  (2.3)  and  Eq.  (2.14). 


It  was  observed  in  [YAO  77]  that  Eq.  (2.14)  yields  a  good  approximation  when  k  «  n  (k  «  1)  or  p  »  1. 
Hence,  Eq.  (2.13)  will  give  a  good  approximation  when  p  «  n  (m  »  1)  or  k  »  1  by  duality.  This  means  that 
one  formula  will  result  in  a  good  approximation  when  its  counterpart  yields  a  poor  one,  and  vice  versa. 
Therefore,  an  obvious  alternative  approach  to  the  one  presented  in  this  paper  is  to  combine  these  two 
formulas  in  such  a  way  as  to  get  a  good  approximation  over  the  entire  range.  As  an  example,  wc  suggest  here 
the  following  formula: 

bWJ(m,p,k)  =  max  (bWAT1(m,p,k),  bWM1(m,p,k)}  (2.16) 

=  max  {m  [1  -  (l-k/ny*],  m  [1  -  (l-p/n)k]}, 


where  ’max’  represents  the  minimum  of  the  two  arguments.  This  equation  will  be  a  good  approximation, 
since  cither  formula  always  produces  a  value  smaller  than  the  exact  formula.  (This  can  be  easily  understood 
by  examining  the  underlying  assumptions.) 
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2.6  Application 

An  implicit  assumption  made  throughout  the  development  of  all  the  formulas  is  that  a  block  is  accessed  no 
more  than  once.  We  encounter  this  situation  in  practice  when  the  records  selected  are  accessed  in  TID  (tuple 
identifier  or  database  key)  order. 

Two  typical  applications  of  these  formulas  are  in  query  optimization  [Y AO  79]  and  physical-database 
design  [HAM  76]  [WHA  81].  The  formulas  are  used  to  estimate  the  number  of  block  accesses,  which  is  an 
important  measure  of  cost.  In  an  approach  employed  in  [WHA  81],  they  are  also  used  to  estimate  the  number 
of  logical  groups  of  records  selected.  A  logical  group  is  a  set  of  records  grouped  according  to  certain 
criteria —for  example,  common  possession  of  the  same  value  on  a  certain  field.  Close  estimation  of  the 
number  of  logical  groups  selected  is  necessary  in  analyzing  the  interactions  among  relations  in  the  design  of  a 
physical  database.  In  this  application,  we  are  very  likely  to  have  low  grouping  factors  (number  of  records  in  a 
group)  that  correspond  to  the  blocking  factors  of  a  block  (physical  group).  For  example,  we  have  a  grouping 
factor  of  1  when  the  records  are  grouped  according  to  the  values  of  a  key  field. 

Although  Cardenas  formula  (currently  used  in  System  R  [SCH  81])  gives  a  reasonable  approximation  in 
many  cases,  it  is  expccially  prone  to  failure  at  low  blocking  factors  (particularly  when  p  <  10).  Eq.  (2.7)  proves 
to  be  very  useful  in  these  situations. 

2.7  Conclusion 

A  closed  noniterative  formula  for  estimating  the  number  of  block  accesses  was  introduced.  It  improves 
Yao’s  exact  formula  in  the  sense  that  it  significantly  reduces  the  computation  time  by  eliminating  the  iterative 
loop,  while  providing  a  practically  negligible  deviation  (maximum  errors 3.7%)  from  the  exact  formula  over 
the  entire  range  of  variables  involved.  The  computational  error  due  to  the  machine's  limited  precision  has 
been  greatly  reduced  as  compared  with  a  method  using  the  Gamma  function  based  on  Yao's  formula.  It 
significantly  improves  Cardenas'  earlier  formula,  which  has  a  maximum  error  of  e-1  =  36.8%  (at  p  =  1). 
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Appendix  A.  Relationships  between  Relations 

In  this  section,  we  demonstrate  that  the  assumption  that  we  made  in  Section  1.2  excluding  M-to-N 
relationships  from  consideration  for  optimization  is  reasonable. 

Relations  can  have  various  relationships  (not  necessarily  semantically  meaningful  ones)  depending  on  the 
characteristics  of  the  domains  of  the  attributes  that  are  related.  For  example,  if  we  relate  a  key  attribute  (or 
set  of  attributes)  in  relation  Rx  and  a  nonkey  attribute  (or  set  of  attributes)  in  relation  R^,  then  Rx  and  R^  have 
a  1-to-N  relationship  with  respect  to  these  attributes  considered.  Relations  Rx  and  R2  will  have  a  1-to-l 
relationship  if  attributes  considered  in  both  relations  are  key  attributes,  and  an  M-to-N  relationship  if  both 
are  nonkey  attributes. 

In  this  section,  we  shall  show  that  a  relation  scheme  any  of  whose  relation  instance  is  a  join  of  two  relations 
which  has  an  M-to-N  relationship  with  respect  to  a  set  of  attributes  A  has  a  multivalued  dependency 
(MVD)-  assuming  that  the  only  predicate  that  relates  these  two  relations  is  the  one  that  represents  the  join 
on  A. 

Intuitively,  if  a  relation  scheme  R  has  an  MVD  A-»-*B  (and  accordingly  A-*-*R  —  B),  where  A  and  B  are 
sets  of  attributes  in  R,  then  in  a  specific  relation  instance  r  of  R,  given  a  specific  value  of  A,  the  values  of 
R-B  are  completely  replicated  for  every  distinct  value  of  B.  Because  of  this  replication,  sets  of  attributes  B 
and  R  —  B  do  not  bear  much  meaningful  relationship,  and  thus  it  does  not  make  much  sense  to  have  both  sets 
of  attributes  together  in  a  single  relation. 

We  believe,  in  accordance  with  the  above  argument,  that  joining  two  relations  that  have  M-to-N 
relationships  with  respect  to  the  set  of  attributes  on  which  the  join  is  performed  is  relatively  infrequent  In 
Section  1.2,  on  the  basis  of  this  argument  we  excluded  from  consideration  as  prospects  for  optimization  join 
operations  on  relations  that  bear  an  M-to-N  relationship. 


We  have  the  following  theorems: 
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Theorem  1:  If  a  relation  scheme  R  has  an  MVD  A-*-*B,  where  A  and  B  are  sets  of  attributes  of  R,  then 
every  relation  r  for  R  is  a  natural  join  of  projections  of  r  on  the  relation  schemes  Rj  =  A,  R2  =  AUB,  Rj  = 
AU(R  -  B),  respectively,  where  Rj,  R2>  and  Rj  possess  the  relationships  shown  in  Figure  A-l. 
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Figure  A-l:  Relation  Schemes  and  Their  Relationships. 

In  this  figure  —  *  represents  a  1-to-N  relationship  with  respect  to  A. 

Proof:  Rj,  R2,  and  Rj  can  be  obtained  by  two  consecutive  lossless  join  decompositions,  i.e.,  decomposition 
of  R  into  AUB  and  AU(R  — B)and  decomposition  of  AUB  into  A  and  AUB.  These  two  decompositions  are 
lossless,  since  we  have  an  MVD  A-»-*B  [ULL  80].  Thus,  the  overall  join  decomposition  of  R  into  Rj,  R2, 
and  Rj  is  also  lossless.  Therefore,  for  any  relation  r  for  R,  r  =  JOIN3_  jnR  (r). 

To  prove  that  Rj  and  R2  has  a  1-to-N  relationship,  we  note  that  A  in  Rj  is  a  key,  since  it  is  the  only 
attribute  (or  set  of  attributes)  in  R,.  However,  A  in  R2  is  generally  not  a  key.  So  we  have  a  1-to-N 
relationship  from  Rj  to  R2. 

When  A  in  R2  is  a  key,  we  have  a  1-to-l  relationship  between  Rj  and  R2,  which  can  be  considered  as  a 
special  case  of  a  1-to-N  relationship.  Similarly,  Rj  and  Rj  have  a  1-to-N  relationship.  Q.E.D. 

Theorem  2:  A  relation  scheme  R  has  MVDs  A-*-*B  and  C— »-»D  if  any  relation  r  for  R  is  a  natural  join 
of  some  relations  r2,  r2,  and  r3  for  relation  schemes  Rj,  R2,  and  Rj,  respectively,  where  R.,  R2,  and  Rj  have 
the  relationships  shown  in  Figure  A-2. 
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Figure  A-2:  Relation  schemes  and  their  relationships. 

In  that  figure  —  *  represents  a  1-to-N  relationship  with  respect  to  A  on  the  left  side  and  one  with  respect 
to  C  on  the  right  side. 

Proof:  Consider  tuples  t  and  s  with  t[A]  =  s[A]  in  a  relation  r  for  R.  Since  r  is  a  natural  join  of  some 

relations  rr  r2,  and  r3,  respectively,  there  must  exist  tuples  Up  u2  in  r3;  Vp  v2  in  r2;  and  wr  w2  in  r3  such  that 

t{A]  =  ujA]  =  vjAj  and  t(CJ  =  Uj[C]  =  wJC] 
s{A]  =  u2(A]  =  v2[A]  and  s[C]  =  u2[CJ  =  w2[C]. 

Since  t(A]  =  sJA],  we  have  uJA]  =  u2[A],  But  since  Rt  and  R2  have  a  1-to-N  relationship  from  R3  to  R2, 
and  they  are  connected  through  A,  A  must  have  unique  values  in  r,.  Hence  u3  =  u2  and  accordingly  uJC]  = 
u2[C]  =  w2[C]. 

Therefore  r  will  contain  a  tuple  z  where 
z(A]  =  vjA]  =  t[A]  =  s[A] 
z{B]  =  vJB]  =  P] 

4R-AUBJ  =  w2[R-AUB1  =  s(R  — AUBJ. 

Thus  R  has  an  MVD  A-*-*B.  By  a  similar  argument,  R  has  C-+-+D.  Q.E.D. 

Corollary:  Let  relation  schemes  R3  and  R2  have  an  M-to-N  relationship  with  respect  to  a  set  of  attributes 
A.  The  relation  scheme  R  whose  relation  instances  arc  natural  joins  on  A  of  two  relations  r3  for  R3  and  r2  for 
R2  has  MVDs  A— — (Rj  -  A)  and  A-— (Rj-A). 
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Proof:  We  can  consider  a  two-relation  join  of  ^  and  r2  as  a  three- relation  join  of  rr  r2»  and  an  imaginary 
relation  riAr1  U  nAr2.  Then  the  relation  scheme  corresponding  to  this  imaginary  relation  has  1-to-N 
relationships  with  R ^  and  R2,  with  respect  to  A,  as  shown  in  Figure  A-3. 


A 


/  \ 

/  \ 

/  \ 


A  Rj-A  A  R2-A 

R1  I  I  III  I  R2 

Figure  A-3:  Relation  R3  has  1-to-N  Relationships  with  R3  and  R2. 

Thus  relation  scheme  R  has  MVDs  A)  and  A-*-*(R2-B)  from  Theorem  2.  Q.E.D. 
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EQUIVALENT  RESTRICTION  FREQUENCY  OF  A  PARTIAL-JOIN 

Appendix  B.  Equivalent  Restriction  Frequency  of 

a  Partial-Join 

In  Section  1.6,  the  equivalent  restriction  frequency  of  a  partial-join  using  the  join  index  method  was  defined 
as  the  ratio  of  the  gain  in  access  cost  by  having  the  restriction  indexes  in  a  partial-join  to  the  gain  in  access  cost 
that  the  same  restriction  indexes  would  yield  in  the  joint  restriction  with  the  join  index.  We  shall  show  in  this 
section  that  this  equivalent  restriction  frequency  of  a  partial  join  using  the  join  index  method  performed  on 
relation  R2  can  be  calculated,  with  one  exceptional  case,  as  Cf12/Fa,  where  Cf12  is  the  coupling  factor  from 
relation  Rx  to  relation  R2  and  Fa  is  the  selectivity  of  the  join  columns  of  relation  R2. 

By  formulating  the  partial-join  cost  and  the  cost  of  the  joint  restriction  in  both  cases  in  which  the  restriction 
index  is  used  and  in  which  the  restriction  index  is  not  used  (or  does  not  exist),  we  shall  show  that  the  number 
of  block  accesses  saved  in  a  partial-join  is  the  same  as  the  number  of  block  accesses  saved  in  the  joint 
restriction  of  the  join  index  and  the  restriction  index  used  in  the  partial-join  multiplied  by  Cf12/Fa. 

We  have  three  general  cases:  in  Case  1  both  the  join  index  and  the  restriction  index  are  nonclustering;  in 
Case  2  the  join  index  is  nonclustering,  while  the  restriction  index  is  clustering;  in  case  3  the  join  index  is 
clustering,  while  the  restriction  index  is  nonclustering. 

Case  1:  both  the  join  index  and  the  restriction  index  are  nonclustering 

a.  When  the  restriction  index  is  used 

Joint  restriction  cost  =  b(m,p,FaXFXn) 

Partial-join  cost  =  (Cf12/Fa)b(m,p,FaXFXn) 

In  a  joint  restriction,  the  number  of  records  selected  is  FaXF.Xn.  We  assume  that  these  records  are  evenly 
spread  and  are  accessed  in  TID  order.  Thus  we  get  b(m,p,FaXFiXn)  block  accesses.  In  a  partial-join,  we  are 
following  the  join  index  in  the  order  of  join  column  value,  and  FflXFiXn  records  are  accessed  for  a  distinct 
join  column  value.  Since  these  records  are  spread  over  the  entire  file  and  are  accessed  in  TID  order,  wc  get 
b(m,p,FaXFiXn)  block  accesses.  This  procedure  is  repeated  for  every  distinct  join  column  value  selected  by 


SEPARABILITY  AS  A  PHYSICAL  DATABASE  DESIGN  METHODOLOGY 


the  coupling  effect  and  the  join  selectivity  (i.e.,  according  to  the  coupling  factor).  The  total  number  of  distinct 
join  column  values  are  1/Fa.  Therefore,  as  the  partial-join  cost,  we  have  (Cf12/P'a)  b(m,p,FaXFjXn). 

b.  When  the'  restriction  index  is  not  used  (or  does  not  exist) 

Joint  restriction  cost  =  b(m,p,FaXn) 

Partial-join  cost  =  (CfJ2/Fa)  b(m,p,FaXn) 

An  analysis  applies  that  is  the  same  as  above  except  that  the  restriction  index  is  not  used.  Thus,  we  have 

F  Xn  selected  records  instead  of  F  XF.Xn. 
a  a  i 

Case  2:  the  join  index  is  nonclustering  while  the  restriction  index  is  clustering 
There  are  two  cases  to  be  considered  separately:  when  F^l  and  when  F.  <  1. 

1.  When  F.>1 

a.  When  the  restriction  index  is  used 

Joint  restriction  cost  =  b^Xm,  p,  F^Xn) 

Partial-join  cost  -  (Cfn/Fa)  b(FXtn,  p,  F^Xn). 

This  case  is  almost  identical  to  Case  1,  except  that  the  restriction  index  is  clustering  and  the  range  within 
which  the  selected  records  can  be  found  is  limited  to  FXm  blocks  instead  of  m  (the  number  of  blocks  of  the 
entire  file).  To  use  b  function  it  is  required  that  FjXm^l. 

b.  When  the  restriction  index  is  not  used  (or  docs  not  exist) 

Joint  restriction  cost  =  b(m,p,FaXn) 

Partial-join  cost  =  (Cf12/Fa)  b(m,p,FaXn) 

This  case  is  exactly  the  same  as  Case  1-b. 

# 

2.  When  F,Xm<  1 

a.  When  the  restriction  index  is  used 
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Joint  restriction  cost  =  FX  bfl/Fj,  FjXn,  F#Xn) 

Partial-join  cost  =  (Cf^/F^XFjXbfl/Fj.FjXn^Xn). 

Since  FjXm  <  1  and  the  restriction  index  is  clustering,  all  records  selected  according  to  the  restriction  index 

will  be  confined  in  an  area  smaller  than  1  block  (let  us  call  this  a  selected  area).  Let  us  assume  that  this 

selected  area  resides  within  a  physical  block  (i.e.,  we  ignore  the  case  in  which  this  selected  area  resides  on  the 

border  of  two  blocks).  If  we  assume  that  the  file  is  divided  into  logical  blocks  of  the  same  size  as  this  selected 

area,  the  probability  that  this  selected  area  will  be  hit  by  a  joint  restriction  is 
(l/(l/F))b(l/F,FXn,FaXn). 

This  is  also  the  probability  that  the  physical  block  containing  the  selected  area  will  be  hit  (note  that  there  are 
1/Fi  logical  blocks  in  the  file).  This  is  also  the  number  of  physical  blocks  to  be  hit  by  the  joint  restriction, 
since  the  physical  block  containing  the  selected  area  is  the  only  one  that  can  possibly  be  accessed. 

In  a  partial-join,  the  same  analysis  is  valid  for  each  distinct  join  column  value,  assuming  that  the  same  block 
must  be  fetched  again  if  a  repeated  forward  scan  inside  this  block  is  to  be  performed.  Thus  the  partial-join 
cost  is  the  product  of  (Cf12/F#)  and  the  joint  restriction  cosl 

b.  When  the  restriction  index  is  not  used 
Joint  restriction  cost  =  b(m,p,FaXn) 

Partial-join  cost  =  (Cf12/Fg)  b(m,p,FaXn) 

This  case  is  exactly  the  same  as  Case  1-b. 

Case  3:  the  join  index  is  clustering,  while  the  restriction  index  is  nonclustering 

1.  When  F#Xm>l 

a.  When  the  restriction  index  is  used 

Joint  restriction  cost  =  b(FaXm,  p,  FaXFjXn) 

Partial-join  cost  =  (Cf12/Fa)  b(FaXm,  p,  FaXF.Xn). 

An  analysis  similar  to  Case  2-1-a  applies,  except  that  the  range  of  the  selected  records  is  limited  to  FaXm 
blocks  instead  of  F.Xm. 
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b.  When  the  restriction  index  is  not  used 

Joint  restriction  cost  =  F  Xm 
a 

Partial-join  cost  =  (Cf12/Fa)XFaXm  =  Cf^Xm. 

Since  the  join  index  is  clustering,  the  number  of  blocks  accessed  is  proportional  to  the  number  of  records 
selected. 

2.  When  F  Xm<  1 

a 

a.  When  the  restriction  index  is  used 

Joint  restriction  cost  =  (1/(1/Fa))  b(l/Fa,  FgXn,  F.Xn) 

Partial-join  cost  =  b(m,  l/(Fam),  Cf12Xb(l/Fa,  FaXn,  F.Xn)). 

The  joint  restriction  cost  can  be  obtained  by  a  similar  analysis  used  in  Case  2-2-a,  except  that  the  roles  of  Ffl 
and  F  are  interchanged. 

In  the  partial-join,  the  entire  file  is  divided  into  1/R  logical  blocks,  each  of  which  contains  F  Xn  records. 

a  a 

According  to  the  restriction  index,  RXn  records  are  selected;  the  number  of  logical  blocks  selected  by  this 
restriction  is  b(l/Fa,  FaXn,  F4Xn). 

The  coupling  factor  Cf12  determines  how  many  distinct  join  column  values  are  actually  selected.  Since  one 
logical  block  corresponds  to  one  distinct  join  column  value,  the  number  of  logical  blocks  selected  according  to 
the  coupling  factor  and  the  selectivity  of  the  restriction  index  is  Cf12Xb(l/Fa,  FaXn,  F.Xn). 

To  calculate  the  number  of  physical  blocks  hit,  let  us  assume  that  the  entire  file  consists  of  m  blocks,  each 
of  which  contains  l/(Fam)  logical  blocks.  Since  CfJ2Xb(l/Fa,  FaXn,  F;Xn)  logical  blocks  are  selected,  the 
number  of  physical  blocks  that  will  be  hit  is  b(m,  1/(F  m),  Cf.,Xb(l/F  ,  F  Xn',  F.Xn)). 

a  JLZ  3  3  1 

b.  When  restriction  index  is  not  used  (or  does  not  exist) 

Joint  restriction  cost  =  1 

Partial-join  cost  =  b(m,l/(Fam),Cf12/Fa) 
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This  can  be  easily  derived  from  Case  3-2-b  by  setting  F  to  1. 

We  have  seen,  in  all  situations  except  Case  3-2,  that  the  partial-join  cost  is  equivalent  to  Cf12/Ffl  times  the 
joint  restriction  cost.  Accordingly,  the  cost  saved  by  having  the  restriction  index  in  a  partial-join  is  Cf^/F^ 
times  the  cost  saved  by  having  the  restriction  index  in  the  joint  restriction. 

Case  3-2  is  the  only  case  in  which  the  equivalent  restriction  frequency  of  a  partial-join  using  the  join  index 
method  cannot  be  represented  as  Cf12/Fa.  The  reason  is  that,  in  a  partial-join,  the  logical  blocks  are  accessed 
in  a  serial  order,  and  thus  several  logical  blocks  may  cause  only  one  block  access.  In  the  case  of  joint 
restriction,  we  need  one  block  access  in  any  case  if  at  least  one  record  is  selected. 

The  derivations  of  the  formulas  were  introduced  to  show  how  we  can  formulate  cost  formulas  with  the  b 
function,  as  well  as  to  show  that,  in  most  cases,  equivalent  restriction  frequency  has  a  simple  form,  Gf12/Fa. 

While  the  detailed  form  of  cost  formulas  depend  on  the  specific  cost  models,  we  believe  that  the  same 
principle  we  used  in  the  derivation  can  be  easily  applied  to  any  given  model. 


i 
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Appendix  C.  Computational  Errors 

C.1  Comparison  of  Computational  Errors 

In  this  appendix  wc  develop  the  prediction  of  the  computational  errors  which  occur  in  the  estimation  of 
block  accesses  discussed  in  Section  2.4.  These  computational  errors  occur  due  to  the  limited  precision  of  the 
computing  system  used. 

Theorem  1:  Calculation  of  Eq.  (2.6)  to  d  digits  of  precision  with  a  possible  error  of  ±1  in  the  least 
significant  digit  (LSD)  requires  at  least  log10(mn  log(n))  +  d  valid  digits  with  a  possible  error  of  ±1  in  the 
LSD. 

Proof:  We  shall  use  a  pseudo  equality  symbol  =  throughout  this  proof  and  the  proof  of  Theorem  2,  ighoring 
the  deviation  from  equality  whenever  it  neither  affects  the  logical  flow  of  the  proof  nor  changes  the  numerical 
result  significantly. 

By  Stirling’s  approximation  (KNU-a  73J, 

T(n  +  1)=  V  2wn(n/e)",  and 

ln(r(n  +  l))=ln(72w)+0.5  ln(n)+n(ln(n)-l) 

=  n  ln(n), 

since  we  are  considering  relatively  large  n’s. 

From  Eq.  (2.6), 

b(m,p,k)  =  1  -  exp[LG AM(n  -  p) + LG  AM(n  -  k)  -  LG  AM(n  -  p  -  k)  -  LG AM(n)]  (2.17) 

=  -  LG  AM(n  -  p)  -  LG  AM(n  -  k) + LG  AM(n  -  p  -  k)+  LG  AM(n). 

Let  us  consider  the  case  in  which  k  =  1.  At  this  k  value,  all  four  terms  in  Eq.  (2.17)  are  close  to  n  ln(n),  the 

result  is  the  smallest  possible,  and  we  shall  get  the  maximum  error.  If  we  assume  that  evaluation  of  Eq.  (2.17) 

causes  the  error  of  ±  1  in  the  LSD,  then  the  error  of  the  result  will  be 
10“ *  X  nln(n), 

where  x  is  the  number  of  significant  digits. 

The  exact  value  of  the  result  of  Eq.  (2.17)  must  be  1/m,  since  only  one  block  will  be  hit  Therefore,  the 
relative  error  caused  by  the  computation  with  x  significant  digits  will  be 

1  fHKHDlNQ  PJJM  BUMC-NOT  FILMS 
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(10_x  X  n  ln(n))/(l/m)=(mn  ln(n))  X  10_x.  (2.18) 

If  we  require  this  to  have  an  error  of  less  than  10~d,  so  that  we  have  d  digits  of  precision  in  the  result  with  a 

possible  error  of  ±  1  in  the  LSD,  Hq.  (2.18)  must  be  less  than  10-d.  Therefore, 
x  >  log10(mn  ln(n))  +  d.  Q.E.D. 

Theorem  2:  x  >  (log  1Q  m)+d  +  log1Q(d)+l  valid  digits  with  a  possible  error  of  ±1  in  the  LSD  are 
sufficient  in  the  calculation  of  Eq.  (2.7)  to  d  digits  of  precision. 

Proof:  The  major  cause  of  the  error  is  in  the  calculation  of  1  —  1/m  as  m  gets  larger,  since  it  requires  as 
many  digits  as  log10  m.  We  shall  use  the  equality  (1  -  l/rn)m  =  e _  1  throughout,  assuming  that  m  is  sufficiently 
large.  For  convenience  let  us  consider  only  the  first  term  of  Eq.  (2.7),  since  the  other  terms  behave  similarly 
and  their  absolute  values  are  always  less  than  (1  -  l/m)k. 

Let  us  divide  the  values  of  k  into  3  ranges:  k  <  0.1m,  k  >  ln(10)XdXm,  and  0.1m  <  k  <  ln(10)XdXm. 

(1)  k  <  O.lni 

From  a  Taylor  expansion  we  have 

(1-  l/m)k  =  l-k/m  +  k(k-  l)/2  X  (1/m)2  ...  =  1-k/m,  and  thus 
1— (1  —  l/m)k  =  k/m. 

In  the  calculation  of(l  —  1/m)  we  have  an  error  of  10  ”x,  so  that,  as  a  result  of  computation,  we  get 
(l-l/m+10_x)k=  l-k(l/m-  1(TX). 

(For  convenience  let  us  consider  only  a  positive  error.  Negative  errors  can  be  treated  similarly.)  Accordingly, 
the  error  of  the  overall  calculation  will  be 

(k(l/m-  10_x)-k/m)/(k/m)=  -  10-x  X  m. 

Thus,  we  get  a  precision  of  d  digits  in  the  result  if  and  only  if 

10_x  X  m  <  10-d,  or 
x  >  (log]0  m)+d. 

(2)  k  >  ln(10)  X  d  X  m 
huhiscascO<(l-l/m)k<  10-d.  Hence, 
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1  >  l-(l-l/m)k  >  1-  10-d2>0.9, 
assuming  d  >  1.  However,  actual  computation  may  yield 
l-(l-l/m+10-x)k. 

Since 

*  >  (log10m)+d+l, 
we  have 

10-x  <(l/m)10~(d+1). 

Since 

(l-l/m+10-(d+1)/m)k 
=  (l-(l-10~(d+1>)/m)k 

<  (1  -(1  -  10_(d^  D)/m)ln(lO)XdXm 

_  10-(i-io_(d+1>)Xd  A  10-d 

assuming  d>l,  the  relative  error,  ((l-l/m)k  -  (l-l/m+10-x)k)/0.9,  cannot  be  greater  than 
(l/0.9X10-d)=10_d.  Thus  we  have  a  precision  of  d  digits  in  the  result 

(3)  0.1m<k<ln(10)XdXm 

We  have 

ln[((l  -  l/m)+  10~x)/(l  -  l/m)kJ 
=  k(ln(  1  -  1/m  + 10  "x)-ln(  1-1/m)) 

=  k((l-l/m  +  10-x)-(l-l/m)) 

=  k  X  10'x. 

Accordingly, 

((1  -  l/m  + 10_  x)k  -  (1  -  l/m)k)/(l  -  l/m)k 
=  exp(kX10"x)  -  1. 

a)m<k<(ln  10)XdXm 

The  relative  error  will  be 

((1  -  l/m+10-x)k  -(1  -  l/m)k)/(l-(l  -  l/m)k) 

<((l-l/m+10-x)k-(l-l/m)k))/(l-l/m)k 
=  cxp(kX10"x)  -  1 

<  exp((k/md)X10-(d+1>)  -  1 
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<cxp(ln(10)X10-(d+1>)  -  1 

<ln(10)XKT(d+1> 

.  =  0.23X10~d. 

llius,  we  have  a  precision  of  d  digits, 
b)  0.1  m<k<m 

We  have 

(1  —  l/m)k  i  1  -  k/m  <  0.9. 

Hence,  the  relative  error  will  be 

((1  -  l/m  +  10-x)k  — (1  —  l/m)k)/(l  -(1  -  l/m)k) 

<  (l/0.1X(l-l/m+]0~x)k-(l-l/m)k) 

<  10((1-  l/m+  10‘x)k-(l  -  l/m)k)/(l~  l/m)k 
=  10(exp(kX10_x)  -  1) 

<  10(cxp((k/m)X10'(d  +  1))  -  1) 

<  10((k/m)X10“(d+1)) 

<  10X10“(d+1) 

=  10  "d. 

This  shows  that  we  have  a  precision  of  d  digits.  Q.E.D. 

Corollary:  Eq.  (2.7)  requires  at  least  x  >  (log10  m)+ d  valid  digits  to  get  d  digits  of  precision  in  the  result 

Proof:  This  follows  from  die  case  (1)  of  Theorem  2.  Q.E.D. 

Applying  Theorem  2  and  its  corollary,  the  actual  requirement  will  be 
(log10m)+d  <  x  <  (log  1Q  m)+d  +  log10(d)+l. 

Example  1:  Let  us  calculate  the  number  of  valid  digits  required  by  the  evaluation  of  Fxj.  (2.7)  and  Fq.  (2.6), 
respectively,  when  m  =  106,  p  =  10,  n  =  107,  and  wc  need  a  precision  of  2  digits  in  the  result 

(a)  For  Eq.  (2.7), 

log10(106)+2+log10(2)+ 1=9.3, 
log10(106)+2  =  8,  and 
8  <  x  <  9.3. 


/ 
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(b)  For  Eq.  (2.6), 

x=log10(106  X  107  X  ln(107))  +  2 
=  16.3. 

We  note  that  Eq.  (2.6)  requires  roughly  twice  as  many  valid  digits  as  does  Eq.  (2.7).  □ 

In  the  exhaustive  calculation  we  made  over  the  range  specified  in  Section  2.3,  the  maximum  error  (0.2%) 
occurred  at  m = 106,  p  =  1,  and  k  «  m  (i.e.  k~  1),  which  actually  corresponds  to  the  lower  bound  given  in  the 
corollary. 

Example  2:  The  error  of  0.2%  is  equivalent  to  a  precision  of  2  digits  according  to  our  definition,  since  0.998 

compared  with  1.0  clearly  has  an  error  exceeding  1  in  the  third  digit,  and  the  first  and  second,  digits  are  the 

only  valid  digits  with  possible  error  of  ±1  in  the  LSD.  Thus,  the  number  of  valid  digits  x  of  the  computer 

required  by  Eq.  (2.7))  when  m  =  106  will  be 
8  <  x  <  9.3 

The  DECSYSTEM-20  has  2  ~ 27  of  resolution,  approximately  corresponding  to  8  valid  digits,  which  confirms 
our  result  □ 

C.2  Computational  Error  in  an  Extended  Range 

The  maximum  computational  error  when  the  number  of  blocks  m  is  extended  to  107  is  4.3%;  it  occurs  at 
k = 1  for  all  values  of  p. 

We  assumed  throughout  that  m  has  only  integer  values.  However,  computer  calculation  performed  over  all 
combinations  of  the  following  range  shows  that  the  maximum  deviation  of  Eq.  (2.7)  from  the  exact  formula  is 
3.7%,  even  for  the  real  values  of  m. 

•  1.1  <  p  <  3.9  with  increments  of  0.1, 

•  1  <  p  <  10  where  p  is  an  integer, 

•  1.1  <  m  <  3.9  with  increments  of  0.1. 
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